freme-project / e-Entity

Apache License 2.0
1 stars 1 forks source link

Very obvious wrong spotted entities #48

Closed jnehring closed 8 years ago

jnehring commented 8 years ago

FREME NER detects ( as an entity in this text:

Madrid (/məˈdrɪd/, Spanish: [maˈðɾið], locally: [maˈðɾiθ, -ˈðɾi]) is a south-western European city and the
        capital and largest municipality of Spain. The population of the city is almost 3.2 million[4] and that of
        the Madrid metropolitan area, around 7 million. It is the third-largest city in the European Union, after
        London and Berlin, and its metropolitan area is the third-largest in the European Union after Paris and
        London.[5][6][7][8] The city spans a total of 604.3 km2 (233.3 sq mi).[9]
        The city is located on the Manzanares River in the centre of both the country and the Community of Madrid
        (which comprises the city of Madrid, its conurbation and extended suburbs and villages); this community
        is bordered by the autonomous communities of Castile and León and Castile-La Mancha. As the capital city of
        Spain, seat of government, and residence of the Spanish monarch, Madrid is also the political, economic and
        cultural centre of Spain.[10] The current mayor is Manuela Carmena from Ahora Madrid.
        The Madrid urban agglomeration has the third-largest GDP[11] in the European Union and its influences
        in politics, education, entertainment, environment, media, fashion, science, culture, and the arts all
        contribute to its status as one of the world's major global cities.[12][13] Due to its economic output,
        high standard of living, and market size, Madrid is considered the major financial centre of Southern
        Europe[14][15] and the Iberian Peninsula; it hosts the head offices of the vast majority of the major
        Spanish companies, such as Telefónica, Iberia and Repsol. Madrid is the 17th most livable city in the
        world according to Monocle magazine, in its 2014 index.[16][17]
        Madrid houses the headquarters of the World Tourism Organization (WTO), belonging to the United Nations
        Organization (UN), the SEGIB, the Organization of Ibero-American States (OEI), and the Public Interest
        Oversight Board (PIOB). It also hosts major international regulators of Spanish: the Standing Committee
        of the Association of Spanish Language Academies, headquarters of the Royal Spanish Academy (RAE), the
        Cervantes Institute and the Foundation of Urgent Spanish (Fundéu BBVA). Madrid organizes fairs such as
        FITUR,[18] ARCO,[19] SIMO TCI[20] and the Cibeles Madrid Fashion Week.[21]
        While Madrid possesses a modern infrastructure, it has preserved the look and feel of many of its historic
        neighbourhoods and streets. Its landmarks include the Royal Palace of Madrid; the Royal Theatre with its
        restored 1850 Opera House; the Buen Retiro Park, founded in 1631; the 19th-century National Library building
        (founded in 1712) containing some of Spain's historical archives; a large number of national museums,[22]
        and the Golden Triangle of Art, located along the Paseo del Prado and comprising three art museums:
        Prado Museum, the Reina Sofía Museum, a museum of modern art, and the Thyssen-Bornemisza Museum, which
        completes the shortcomings of the other two museums.[23] Cibeles Palace and Fountain have become the
        monument symbol of the city.[24][25][26]
        Madrid is home to two world-famous football clubs, Real Madrid and Atlético de Madrid.

Excerpt from the response:

<http://freme-project.eu/#char=1950,1951>
        a                     nif:Word , nif:RFC5147String , nif:Phrase , nif:String ;
        nif:anchorOf          "("^^xsd:string ;
        nif:beginIndex        "1950"^^xsd:int ;
        nif:endIndex          "1951"^^xsd:int ;
        nif:referenceContext  <http://freme-project.eu/#char=0,3388> ;
        itsrdf:taClassRef     []  ;
        itsrdf:taConfidence   "1.0"^^xsd:double ;
        itsrdf:taIdentRef     <http://dbpedia.org/resource/United_States> .

I wonder how that can happen. How is ( related to the United States?

m1ci commented 8 years ago

1) can you put the text in doc and share the exact curl request? 2) is this issue about spotting or linking?

jnehring commented 8 years ago

1) can you put the text in doc and share the exact curl request?

I do not use curl. You can reproduce the bug through the API tester: Copy the above text in the body, set informat=text, dataset=dbpedia, language=en and click "try it out". This also produces a cURL request.

2) is this issue about spotting or linking?

Actually there are three issues

I guess that you cannot control spotting because you use stanford ner so I suggest to focus on the wrong link.

x-fran commented 8 years ago

Please run this

curl -X POST --header "Content-Type: text/plain" --header "Accept: text/n3" -d "Eddie Hearn has revealed that he has had another offer to make Carl Frampton-Scott Quigg turned down by Cyclone Promotions.

As has previously been said ad nauseam, Hearn has had three different purse splits rejected by Frampton and Cyclone – 60/40 to the winner, 50-50, and a flat £1.5million purse.

Cyclone, on the other hand, believe that Frampton, as a legitimate World Champion, unlike Quigg, deserves a guaranteed 60-40 share of the pure at the very least.

However, since their respective victories on July 18th, Hearn has submitted an improved offer which gave Frampton a guaranteed majority split of the purse, but less than 60/40 – and these terms have been rejected.

Detailing the tortuous negotiations to YouTube channel iFilm London TV, a somewhat frustrated Hearn explained that “we have gone back and given them a bigger percentage, but we need them to also move on their percentage.”

“We swallowed some humble pie and we need them to move just a touch, but unfortunately not.”

“I’m desperate for this fight, so is Scott Quigg – that’s why, even after the last performance, we’ve gone back and given more, but we can’t get any given back in return.”

Refusing to offer the 60/40 demanded by Cyclone, the Matchroom head honcho argues that “I know people say ‘why don’t you move?’ but we’ve already moved.”

“There has to be a balance in every deal”

Hearn feels that Cyclone’s reluctance to budge on their demands “is because they don’t want the fight.”

“You’re talking about a couple of percent, it might be £150,000, so don’t tell me you want this fight.”

“You don’t believe you can win the fight.”

“Not Carl Frampton, I believe he does, but the people that make the decisions.”

Hearn feels that accepting his terms is the best possible financial decision for Frampton, although he doesn’t hold out much hope, reasoning that “Mares and Santa Cruz got $1.25million each, that’s £700,000 (or €1.1million).”

“I’ve already offered you (Frampton) double that, and now you’re gonna make more than that, much more, on the split that we’ve offered.”

“But you won’t, out of stubbornness and ego, just move a couple of percent like we did.”" "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?informat=text&outformat=json-ld&language=en&dataset=dbpedia"

In the response you will have


"nif:anchorOf": "“You",
      "beginIndex": "1578",
      "endIndex": "1582",
      "referenceContext": "http://freme-project.eu/#char=0,2156",
      "taClassRef": "http://www.w3.org/2002/07/owl#Thing",
      "itsrdf:taConfidence": 1,
      "taIdentRef": "http://dbpedia.org/resource/You"

The content is coming from this link: http://www.irish-boxing.com/hearn-improved-offer-frampton-rejected/

m1ci commented 8 years ago

@xFran Regarding the issue: "You" marked as entity. I see this is text with "conversation". Unfortunately, we trained on general text, and not "conversation" texts - which is specific. On the other side, see that there are also many correct entities: Carl Frampton-Scott Quigg, Youtube, Hearn, Scott Quigg, etc.

For usual mistakes, one solution might be that we can create a list of entities on a black list, and just remove them from the output. By this we will get higher precision but lower recall.

m1ci commented 8 years ago

@jnehring I put the text in .txt doc, run the following cURL command but I don't get ( spotted as entity.

curl -v -d @madrid.txt "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?informat=text&outformat=turtle&dataset=dbpedia&language=en" -H "Content-Type:"

The document in on Google drive: https://drive.google.com/open?id=0BxeQvF3BluZUVGZFZXBPdzNyT0U

Am I missing smth here?

x-fran commented 8 years ago

@m1ci The problem is when you write a application that is using FREME NER you don't..., better said I don't copy the content that I get from our users paste it in a .txt file cut/copy again the content and send it to FREME NER. Should I have to do that? I don't know.

Please take a look at this:

screenshot from 2015-09-18 16 04 00

"“You"^^xsd:string ; This entity is not "You" only, because is "“You". Also is true that in my code where I clean up the content I forgot about " “ ". I don't even have it in my keyboard.

m1ci commented 8 years ago

better said I don't copy the content that I get from our users paste it in a .txt file cut/copy again the content and send it to FREME NER. Should I have to do that?

We talk about debugging here not how to consume FREME NER in production. We are also talking about re-producing problems. Screenshots are not enough to reproduce problems! ... but concrete examples.

What I was asking is that you put your content in document and share it! So here it is (work for max 30 sec): https://drive.google.com/open?id=0BxeQvF3BluZUOWVfclFrNzlRTmc and the cURL to process this content:

curl -v -d @test-1.txt "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?informat=text&outformat=turtle&dataset=dbpedia&language=en" -H "Content-Type:"

... finally, I managed to reproduce the problem - I get "You marked as entity. However, if you remove the quotes, it is not marked as entity.

x-fran commented 8 years ago

We talk about debugging here not how to consume FREME NER in production. We are also talking about re-producing problems. Screenshots are not enough to reproduce problems! ... but concrete examples.

As far as I know to reproduce an error you must use the same environment. Copy and paste the content in a .txt or .doc file is not the same environment so you cannot reproduce the error. That's why we provide screenshots.

m1ci commented 8 years ago

As far as I know to reproduce an error you must use the same environment.

This is what I'm trying to do - share test data. Hey, you share screenshots! We can't help you with screenshots.

x-fran commented 8 years ago

Hey @m1ci, we might be going on the wrong track here so please allow me to ask you a few questions:

This issue has been reproduced by at least 2 persons could you agree that there's a possibility for the issue to be on your end? Can we all agree the entity returned "(" is not valid entity? Since myself and @jnehring are getting the same error within different environments would you say both of us are wrong? Wripl is using e-Entity in production we need to know what's the percentage level of your unit test code coverage?

m1ci commented 8 years ago

This issue has been reproduced by at least 2 persons could you agree that there's a possibility for the issue to be on your end?

Which issue? Yes, I managed to capture "You as incorrectly spotted entity.

Can we all agree the entity returned "(" is not valid entity?

So far, I didn't manage to reproduce this issue. Please check my comment https://github.com/freme-project/e-Entity/issues/48#issuecomment-141472261 and provide exact data. Otherwise we can't help.

Since myself and @jnehring are getting the same error within different environments would you say both of us are wrong?

I am not saying you are wrong or right, where you read this? Please put your data in .txt and share with us so we can reproduce the problem. I asked several times, you didn't do that. Sorry, but we can't reproduce the problem without concrete content (text). Sharing data via screenshots or links to wikipedia pages is not the right way. Just copy, paste the content in .txt and share it with me. Can you do this?

Wripl is using e-Entity in production we need to know what's the percentage level of your unit test code coverage?

I don't understand your questions. Are you asking for evaluation results of 1) entity spotting, 2) entity linking, or 3) entity classification.

Regarding 1) and 3), we are using state-of-the art algorithm, regarding 2) this is going to be evaluated within FREME 0.4 period.

x-fran commented 8 years ago

So far, I didn't manage to reproduce this issue. Please check my comment #48 (comment) and provide exact data. Otherwise we can't help.

You don't believe us? :) If you cannot reproduce this means it won't be fixt?

I understand you can't reproduce this issue and it may even not be possible with your current environment setup, but the problem exists and it's reported by two developers so in most circumstances a good developer would consider this an issue and find a way to fix it.

Like I said you're not able to reproduce the error because is not the same environment after copy/paste and save in .txt doc. By doing that you will change the char encoding, change the text format, BOM etc. unless you're not aware of that, are you?

@jnehring said:

I do not use curl. You can reproduce the bug through the API tester: Copy the above text in the body, set informat=text, dataset=dbpedia, language=en and click "try it out". This also produces a cURL request.

You did that? Reproduce the bug through the API tester?

In what cases you think FREME NER is spotting "(" as an entity so you can help me understand what is going wrong, and I can avoid this issue on my side?

Regarding 1) and 3), we are using state-of-the art algorithm, regarding 2) this is going to be evaluated within FREME 0.4 period.

state-of-the art algorithm is not a unit test. I mean something like JUnit.

fsasaki commented 8 years ago

Guys, I am a bit concerned about the tone in this thread. Statements like "a good developer would consider this an issue and find a way to fix it" don't help to move the issue forward. I am happy to organise a call about the issue next week. It may be easier to resolve this on the phone. Let me know if you want to do that.

x-fran commented 8 years ago

This is just usual small talk between developers. :) Nothing to be concerned about. I'm not.

You're right about the call @fsasaki. I'm sure we can all find a day in our busy agenda next week to make it happen.

fsasaki commented 8 years ago

I am actually concerned about this, @xFran . So let's find a time to talk.

jnehring commented 8 years ago

I agree. We should find a time to talk about this. Lets continue via email to find a date. Meanwhile @m1ci and I will work on reproducing the issue on Milans computer.

koidl commented 8 years ago

Hi

Kevin here this time. We have similar issues now popping up in different dashboard. For example this one (below) indicating 'B' as a Label. I am not sure if the right strategy is to pick out everyone and send them over to you. There seems to be a pattern however. E.g. Exclude all labels that are not 100% alphanumerical, exclude all words that have less then one character ...? Happy to look for this one - its hard on our side though. The WP plugin sends the text and we don't store the text it goes into Solr and only the link is stored in the db. Pulling out content from Solr is possible but difficult (havent done that in years will have to check how to do it).

Will I look for the related page to the issue below although I have the feeling that we will be chasing these bugs for a long time if we don't start focusing on patterns?

We will also start using categories which might reduce some of these issues

screen shot 2015-09-21 at 10 13 52
jnehring commented 8 years ago

In this thread we are talking about ~4 different wrongly spotted entities so now 17 comments later I dont understand what exactly we are talking about. Each of the wrongly spotted entities might originate from a different problem and therefore deserves an own issue.

So my suggestion: Lets open a new GitHub issue for each wrong entity and then we can somehow organize these with GitHub label system. We should also agree on the format: E.g. paste the CURL command. And also try to use a short curl command. E.g. my example above has 3420 characters, but the error can be reproduced with one sentence. So it is better to paste short examples that are easy to debug.

This is just some input for the call tomorrow, we can also agree on a different procedure, but the current procedure does not work.

koidl commented 8 years ago

@jnehring I agree this is not working very well. Its way too complicated from our side. I will try to explain. I am now also wondering if this is fixable at all and if dbpedia spotlight has the same issues?

  1. I need to go to the dashboard of a website (at the moment we have over 200)
  2. There I need to actually spot a fault (such as the 'B' example above)
  3. The labels in the dashboard above are only the top 10 therefore I have to be lucky to even see one
  4. When a faulty label name is spotted I need to go to the database and search over all entries of that website ID with a %LIKE% command due to the labels being stored in an array associated to the page.
  5. A %LIKE% for 'B' for example brings back all the labels with 'B' in it such as 'Boxing' 'Boxers' etc.
  6. I can then keep looking/guessing or redesigning a query that spots this which is a lot of work and time for one possible wrong label name....
  7. I then pull out the URL and try to pull the text in SOLR where we store the sent content

Its a bit like looking for a needle in a haystack after being lucky that I found out in what haystack to look for it.

Lets think of a solution here, maybe:

  1. We only use the new categories (which are very general but maybe okay for the moment) - this wont fix anything though and only buys time
  2. We keep going like this and open a new issue for each wrong label that we find by chance in the hope that it fixes all over time.
  3. We drop FREME NER and switch back to opencalais and work from there - the dashboard is still on ALPHA but customers can now see the dashboard therefore we need to find some fix.
  4. We design smart filters where only valid words pass (nothing with one letter or non Alphanumerical)?

Let me know what you think. We will support and solution finding path needed Kevin

fsasaki commented 8 years ago

Hi Kevin,

the current approach seems to be complicated and very time consuming for both sides. There are still some misunderstandings about how to parametrize freme ner. Indeed let's discuss on the phone. In Turin we considered an interim technical f2f. It may be needed indeed. Am 21.09.2015 17:09 schrieb "Kevin Koidl" notifications@github.com:

@jnehring https://github.com/jnehring I agree this is not working very well. Its way too complicated from our side. I will try to explain. I am now also wondering if this is fixable at all and if dbpedia spotlight has the same issues?

  1. I need to go to the dashboard of a website (at the moment we have over 200)
  2. There I need to actually spot a fault (such as the 'B' example above)
  3. The labels in the dashboard above are only the top 10 therefore I have to be lucky to even see one
  4. When a faulty label name is spotted I need to go to the database and search over all entries of that website ID with a %LIKE% command due to the labels being stored in an array associated to the page.
  5. A %LIKE% for 'B' for example brings back all the labels with 'B' in it such as 'Boxing' 'Boxers' etc.
  6. I can then keep looking/guessing or redesigning a query that spots this which is a lot of work and time for one possible wrong label name....
  7. I then pull out the URL and try to pull the text in SOLR where we store the sent content

Its a bit like looking for a needle in a haystack after being lucky that I found out in what haystack to look for it.

Lets think of a solution here, maybe:

  1. We only use the new categories (which are very general but maybe okay for the moment) - this wont fix anything though and only buys time
  2. We keep going like this and open a new issue for each wrong label that we find by chance in the hope that it fixes all over time.
  3. We drop FREME NER and switch back to opencalais and work from there
  4. the dashboard is still on ALPHA but customers can now see the dashboard therefore we need to find some fix.
  5. We design smart filters where only valid words pass (nothing with one letter or non Alphanumerical)?

Let me know what you think. We will support and solution finding path needed Kevin

— Reply to this email directly or view it on GitHub https://github.com/freme-project/e-Entity/issues/48#issuecomment-142010063 .

jnehring commented 8 years ago

@koidl wrote

Lets think of a solution here, maybe:

  1. We only use the new categories (which are very general but maybe okay for the moment) - this wont fix anything though and only buys time

What else do you use right now? I thought you use only new categories.

  1. We keep going like this and open a new issue for each wrong label that we find by chance in the hope that it fixes all over time.

State of the art precision of named entity recognition / linking is far away from 100%, I think its more like 80% on "good" text, probably less on noisy web text. So putting each wrong label in an issue will not help us also. We should discuss today which type of error makes sense to report.

  1. We drop FREME NER and switch back to opencalais and work from there - the dashboard is still on ALPHA but customers can now see the dashboard therefore we need to find some fix.

Please dont do that.

  1. We design smart filters where only valid words pass (nothing with one letter or non Alphanumerical)?

When such a filter improves the performance of FREME NER then we should integrate it on the API and not on client side. Of course you can implement it on the client side. We need some time here. Nilesh has many ideas how to improve FREME NER and cannot do all on the same time.

koidl commented 8 years ago

Sounds good. This stuff is really hard and we really appreciate everyones hard work on this. Pressure is high on all sides and we all want to make this a success story.

Unfortunately I wont be able to attend the meeting today. I have a workshop style meeting from 9:30 - 2:30 today that I can't push. Its related to Trinities digital library repository (over 100.000 documents including the book of Kells). They are testing wripl currently and want to integrate.

Maybe the idea of a smart filter (possibly based on RegExpressions) on the FREME NER side could work. The filter might also be used to identify faults e.g. logging all the ones that don't pass and then investigating why. On our side we can track the ones that pass the filter and in case they are wrong we can extend the filter accordingly.

It may also bring the idea back that we (wripl) write an interface that allows us to see the entities and approve or reject them. This wont work on scale but it could help short term (over the next 3 months for example).

We are testing the categories and want to use both (entities and categories) depending on the case. Analysing the categories before deciding to use an entity might also be a solution but need to look more into it.

Kevin

jnehring commented 8 years ago

To summarize and close the issue:

The issue of ( being linked to the united states is solved. @m1ci could reproduce and @nilesh-c could fix the bug.

"You" being linked to http://dbpedia.org/resource/You. I would not consider this a bug so I do not raise a new issue, I think that quotes are a strong indicator that it is an entity. Putting you in quotes in this context counts to me as noisy text.

koidl commented 8 years ago

@jnehring I would need a quick update what the strategy is. Are we still on the path of reporting each wrong label or are we looking into smart filters? Are we moving this to mail now that its closed?

m1ci commented 8 years ago

@jnehring I would need a quick update what the strategy is. Are we still on the path of reporting each wrong label or are we looking into smart filters? Are we moving this to mail now that its closed?

IMO feel free to report wrongly spotted or linked entities which are causing you problems, or in other words, entities which are occurring very often in your data and they are wrongly spotted or linked.

Also, in our telco on Sep 22nd we agreed that such issues will be reported and an appropriate cURL command reproducing the problem will be provided. It the problem occurs in long texts, please share also the text - upload it somewhere in dropbox or Google drive so we can download it.

@jnehring please correct me if I missed smth. Thanks!

jnehring commented 8 years ago

@jnehring I would need a quick update what the strategy is. Are we still on the path of reporting each wrong label or are we looking into smart filters? Are we moving this to mail now that its closed?

I agree with Milan, you can still report wrong spotted entities because it might help us to find bugs. Just create a separate GitHub issue for each wrong spotting please.

Thanks for reminding me about the smart filters, I created #51 for that.