freme-project / e-Entity

Apache License 2.0
1 stars 1 forks source link

Feedback on topic detection #50

Closed jnehring closed 8 years ago

jnehring commented 8 years ago

@m1ci wrote

@koidl the topics and associated weights and labels are already provided as part of the output. See http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=Berlin+is+in+Germany.&outformat=turtle&language=en&dataset=dbpedia&enrichement=dbpedia-categories

The categories information are not included by default. You need to add the enrichement=dbpedia-categories parameter to include also the topics as part of the output.

Any feedback is more than welcome.

fsasaki commented 8 years ago

What is the meaning of freme-onto in "freme-onto:info"? See below: { "@id": "dbpedia:Category:Western_Europe", "freme-onto:info": "14.821275727763961", "rdfs:label": { "@language": "en", "@value": "Western Europe" } }

m1ci commented 8 years ago

This is "informativeness" score for the category computed out of DBpedia. Categories which are assigned to less entities in DBpedia, are considered as more informative. Those which are assigned to more entities, are considered to be less informative. To compute the informativeness values we use formulas from the information theory. See https://en.wikipedia.org/wiki/Self-information

The dataset with the topics counts and topics informativeness can be downloaded from here http://rv2622.1blu.de/datasets/dbpedia-categories/dbpedia-categories-counts.ttl

m1ci commented 8 years ago

What is the meaning of freme-onto in "freme-onto:info"?

It is a prefix for a predicate.

fsasaki commented 8 years ago

ah, sorry, I was unclear. My question was: "what is the meaning of freme-onto:info ?"

koidl commented 8 years ago

Thanks for this. We will be looking at this later this week and start integrating. Thanks a million

kevin

jnehring commented 8 years ago

I was playing around with topic detection. I send a news article about boxing to FREME NER and then wrote a script that extracts all categories from the output and sorts them by informativeness. Here is a sorted list of all 121 categories with their informativeness values:

1   dbc:British_Islands 15.94680660984782
2   dbc:Slavic_countries_and_territories    15.821275727763961
3   dbc:Boxing_venues_in_the_United_Kingdom 15.705798510344026
4   dbc:Free_labor  15.705798510344026
5   dbc:Member_states_of_the_Council_of_Europe  15.705798510344026
6   dbc:London_sub_regions  15.705798510344026
7   dbc:People_from_Scarborough,_North_Yorkshire_(district) 15.705798510344026
8   dbc:Jarrow  15.598883306427513
9   dbc:Port_cities_and_towns_in_England    15.598883306427513
10  dbc:Military_units_and_formations_of_Great_Britain_in_the_American_Revolutionary_War    15.499347632876601
11  dbc:Alumni_of_the_English_Martyrs_School_and_Sixth_Form_College 15.499347632876601
12  dbc:Staple_ports    15.406238228485117
13  dbc:Towns_in_West_Sussex    15.406238228485117
14  dbc:Sport_in_Tower_Hamlets  15.31877538723478
15  dbc:Post_towns_in_the_RH_postcode_area  15.31877538723478
16  dbc:Germanic_countries_and_territories  15.236313227042807
17  dbc:British_capitals    15.236313227042807
18  dbc:Local_government_in_West_Sussex 15.236313227042807
19  dbc:G20_nations 15.158310715041532
20  dbc:Altruism    15.158310715041532
21  dbc:1936_establishments_in_Australia    15.084310133597755
22  dbc:Areas_of_Rochdale_Borough   15.013920805706357
23  dbc:Recurring_sporting_events_established_in_1930   14.946806609847819
24  dbc:British_Armed_Forces    14.882676272428105
25  dbc:Western_Europe  14.821275727763961
26  dbc:Central_Europe  14.821275727763961
27  dbc:Australian_art_awards   14.821275727763961
28  dbc:Commonwealth_of_Nations 14.705798510344026
29  dbc:Populated_places_established_in_the_1st_century 14.705798510344026
30  dbc:Awards_established_in_1936  14.705798510344026
31  dbc:Ice_hockey_leagues_in_Ontario   14.651350726321649
32  dbc:Towns_in_Tyne_and_Wear  14.651350726321649
33  dbc:Commonwealth_sport  14.598883306427513
34  dbc:Crawley 14.598883306427513
35  dbc:Member_states_of_NATO   14.598883306427513
36  dbc:Visitor_attractions_in_Tower_Hamlets    14.598883306427513
37  dbc:Sports_originating_in_England   14.499347632876601
38  dbc:Royal_Marines   14.452041918098244
39  dbc:Summer_Olympic_sports   14.452041918098244
40  dbc:Northern_Europe 14.361844109126665
41  dbc:Geography_of_Tower_Hamlets  14.361844109126665
42  dbc:Post_towns_in_the_NE_postcode_area  14.318775387234778
43  dbc:Commonwealth_Games  14.318775387234778
44  dbc:Boxing  14.318775387234778
45  dbc:New_towns_in_England    14.276955211540152
46  dbc:Member_states_of_the_European_Union 14.276955211540152
47  dbc:Sports_organisations_of_the_United_Kingdom  14.236313227042807
48  dbc:Social_ethics   14.196784862856168
49  dbc:Member_states_of_the_Union_for_the_Mediterranean    14.158310715041532
50  dbc:Constitutional_monarchies   14.158310715041532
51  dbc:Politics_and_sports 14.12083600962287
52  dbc:English_geneticists 14.084310133597755
53  dbc:European_martial_arts   14.048686223867035
54  dbc:Fullerian_Professors_of_Physiology  13.914385132155443
55  dbc:Orthography 13.882676272428105
56  dbc:Arthurian_locations 13.821275727763961
57  dbc:People_from_Consett 13.791528384369908
58  dbc:Boxers_at_the_2014_Commonwealth_Games   13.762382038710392
59  dbc:Volunteerism    13.762382038710392
60  dbc:British_Commandos   13.762382038710392
61  dbc:Member_states_of_the_Commonwealth_of_Nations    13.67831777392192
62  dbc:Countries_in_Europe 13.651350726321649
63  dbc:Newcastle_United_F.C._non-playing_staff 13.573348214320376
64  dbc:Junior_ice_hockey_leagues_in_Canada 13.523595179123276
65  dbc:Civil_society   13.499347632876601
66  dbc:Capitals_in_Europe  13.499347632876601
67  dbc:Sports_venues_in_London 13.452041918098244
68  dbc:States_and_territories_established_in_1918  13.428958304985203
69  dbc:Local_government_districts_of_South_East_England    13.340149038027347
70  dbc:Fire    13.297713771706949
71  dbc:Giving  13.196784862856168
72  dbc:Multi-sport_events  13.139451687790215
73  dbc:History_of_Tower_Hamlets    13.102457480308015
74  dbc:Military_of_the_United_Kingdom  13.084310133597755
75  dbc:Basic_financial_concepts    13.084310133597755
76  dbc:Presidents_of_the_British_Science_Association   13.066388225600493
77  dbc:Combat_sports   13.048686223867035
78  dbc:Philanthropy    12.97997347378302
79  dbc:Writing_systems 12.93050479751872
80  dbc:Liberal_democracies 12.882676272428105
81  dbc:English-speaking_countries_and_territories  12.821275727763961
82  dbc:Reading_(process)   12.776881608405509
83  dbc:People_from_Hartlepool  12.733812886513624
84  dbc:Public_finance  12.719737701301899
85  dbc:Olympic_boxers_of_Great_Britain 12.638053903708192
86  dbc:Applied_linguistics 12.624878514960459
87  dbc:Buildings_and_structures_in_Tower_Hamlets   12.475500890922232
88  dbc:Public_administration   12.463723723145877
89  dbc:Republics   12.463723723145877
90  dbc:British_Empire  12.395010973061865
91  dbc:Island_countries    12.395010973061865
92  dbc:Individual_sports   12.361844109126665
93  dbc:Articles_including_recorded_pronunciations_(UK_English) 12.361844109126665
94  dbc:Team_sports 12.048686223867035
95  dbc:Taxation    11.938632678402122
96  dbc:Member_states_of_the_United_Nations 11.813781191217037
97  dbc:Social_philosophy   11.469600289482548
98  dbc:Educational_psychology  11.345542296797564
99  dbc:Boxers_at_the_2012_Summer_Olympics  11.246366891706728
100 dbc:Commonwealth_Games_gold_medallists_for_England  11.005358792202934
101 dbc:Finance 10.882676272428105
102 dbc:Royal_Medal_winners 10.765993292262772
103 dbc:English_boxers  10.598883306427513
104 dbc:Commonwealth_Games_competitors_for_England  10.340149038027345
105 dbc:People_educated_at_Rugby_School 10.251420119433014
106 dbc:Scunthorpe_United_F.C._players  10.21888615528462
107 dbc:Middleweight_boxers 10.209021535375065
108 dbc:Carlisle_United_F.C._players    10.0117755338748
109 dbc:Areas_of_London 9.969526686347905
110 dbc:Barnsley_F.C._players   9.89448557471774
111 dbc:Darlington_F.C._players 9.522067709376683
112 dbc:Alumni_of_St_John's_College,_Cambridge  9.073082878174501
113 dbc:1926_deaths 8.743569852967576
114 dbc:1861_births 8.426813033153644
115 dbc:Fellows_of_the_Royal_Society    7.1139165956830785
116 dbc:Articles_containing_video_clips 6.744905475689931
117 dbc:1961_births 6.23537427807319
118 dbc:1991_births 6.231624581249177
119 dbc:Association_football_defenders  5.542342354790045
120 dbc:English_footballers 5.408794089514977
121 dbc:Living_people   -0.0

Out of the first 10, only "dbc:Boxing_venues_in_the_United_Kingdom" is related to the topic of the text. "dbc:British_Islands" might be ok also because it is about boxing in the UK. The category dbc:Boxing is ranked 44. There are many sport related categories, but they are spread all over the ranks / concentrated in the center and lower ranks. Category sports is not in the category list.

Actually the word "boxing" appears in almost every sentence of the document. Maybe we should not take only informativeness into account but also how frequently an entity appears?

The last informativeness value is 0 / negative ? Is this possible? Negative informativeness might indicate a bug.

I can share my script if anyone is interested. Its written in PHP.

m1ci commented 8 years ago

The last informativeness value is 0 / negative ? Is this possible? Negative informativeness might indicate a bug.

Yes, it seems to be a bug. Actually the Living_people category is the least informative category - a category with the most DBpedia resources.

Actually the word "boxing" appears in almost every sentence of the document. Maybe we should not take only informativeness into account but also how frequently an entity appears?

maybe combination of both.

m1ci commented 8 years ago

... waiting for feedback from @koidl and/or @xFran

fsasaki commented 8 years ago

@koidl and/or @xFran, if you have feedback on this could you please provide it in this issue? Thanks.

x-fran commented 8 years ago

I'm busy right now. I have to write an app for Trinity College Dublin/FALCON Project. Soon as I have some time to implement topic detection in Wripl's application or Kevin has time to test it we will provide feedback. Sorry about this guys.

x-fran commented 8 years ago

Ok. My other project is going well so, I may be able to take a look at this Monday. Is there anything where I have to focus more my attention? Or just use some random text and check if the topics makes sense. Topic detection is working only for English? Maybe the last question it's a silly one. But hey! Is mine! :-)

Any suggestions please? Thanks in advance.

jnehring commented 8 years ago

I would suggest that you take text that is actually used by WRIPL and comes from one of the WRIPL domains, e.g. finance. It should be text where you have clear expectations what the topics are. I suggest you start with English text. I think it should work for all languages that FREME NER supports. So you can test other languages also as long as you understand them so you can judge if the topic detection is right.

x-fran commented 8 years ago

Ok @jnehring. I will try to do may best following your recommendations. Regarding languages I can test in English, Spanish and Romanian but is not supported by NER engine I believe (it will be nice to have it in the future) and Kevin in German. Those are the languages we manage here.

Thank you.

m1ci commented 8 years ago

Currently, entities can be enriched with categories only for entities spotted in English texts. Note that the categories are retrieved from DBpedia. You can check all available categories for an entity by 1) visiting the DBpedia page for the entity and looking at dcterms:subject, or 2) query them via our Freme LDF endpoint at rv2622.1blu.de:5000/dbpedia-types

x-fran commented 8 years ago

Thank you for the note @m1ci.

x-fran commented 8 years ago

Hi all

I'm running some tests here. Please see the attached text file.

test.txt

First cUrl command on FREME's production server

curl -v -d @test.txt "http://api.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=&outformat=json-ld&language=en&dataset=dbpedia"

Response:

{
  "timestamp": 1445250576139,
  "error": "Bad Gateway",
  "status": 502,
  "exception": "eu.freme.broker.exception.ExternalServiceFailedException",
  "path": "//e-entity/freme-ner/documents/"
* Closing connection 0
}

Second cUrl command on FREME's dev server

curl -v -d @test.txt "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=&outformat=json-ld&language=en&dataset=dbpedia" 

Response:

HTTP/1.1 200 OK

But the strange part here is that even the response is 200 OK I don't have any entity spotted.

And no categories offcourse.

I really hope I did something wrong because if not this can affect Wripl a lot.

jnehring commented 8 years ago

@xFran

You discovered a bug in FREME NER and i created a bug report for this: https://github.com/freme-project/freme-ner/issues/47

The problem lies within the &input=&outformat=... part of your request. The input parameter overwrites the post body. So you submit an empty string to enrichment. I could reproduce the bug and when i remove the bit &input= then it works.

x-fran commented 8 years ago

Second test

I'm using the same boxing article as @jnehring used before.

FREME's production server I get the same error. No changes here.

In FREME's development server

curl -v -d @test.txt "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=&outformat=json-ld&language=en&dataset=dbpedia&enrichement=dbpedia-categories"

I get 3 entities and 11 categories not all related to boxing or sports but seems to make sense.

jnehring commented 8 years ago

The category system is not deployed on production. Please use dev for testing.

x-fran commented 8 years ago

For the categories I used dev. I just wanted to see if any differences sporting entities between dev and prod.

m1ci commented 8 years ago

@xFran Your cURL is not OK. cURL by default sets the Content-Type header to application/x-www-form-urlencoded which is then incorrectly interpreted by the API. When calling the service set also the Content-Type header to an empty value using -H "Content-Type:"

curl -v -d @test-2.txt "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=&outformat=json-ld&language=en&dataset=dbpedia&enrichement=dbpedia-categories" -H "Content-Type:"
x-fran commented 8 years ago

@m1ci Still no entities.

}(13:36:00) ○ [fran@xps] ~/Downloads 
→ curl -v -d @test.txt "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&outformat=json-ld&language=en&dataset=dbpedia&enrichement=dbpedia-categories" -H "Content-Type:"

cUrl fixed.

x-fran commented 8 years ago

Removing the first paragraph I get:

One entity

 {
    "@id" : "http://freme-project.eu/#char=10,17",
    "@type" : [ "nif:Phrase", "nif:Word", "nif:RFC5147String", "nif:String" ],
    "nif:anchorOf" : "Germany",
    "beginIndex" : "10",
    "endIndex" : "17",
    "referenceContext" : "http://freme-project.eu/#char=0,2889",
    "taClassRef" : "http://nerd.eurecom.fr/ontology#Location",
    "itsrdf:taConfidence" : 0.5323128034518319,
    "taIdentRef" : "dbpedia:Germany"
  }

and 11 categories.

The categories are not really relevant for this article.

Germany/German is mentioned 13 times in the article so this entity is OK. China/China’s is mentioned 4 times and was not spotted as a entity.

m1ci commented 8 years ago

@m1ci Still no entities.

I believe this is related to the https://github.com/freme-project/freme-ner/issues/41 issue

jnehring commented 8 years ago

Can you please move the discussion about getting no entities to a new GitHub issue? We should not mix up different topics in the same discussion.

x-fran commented 8 years ago

I can move it into another GitHub issue no probs. This is about "Feedback on topic detection" I don't know if this is a bug and/or if is there any relations between the issues or just my fault because I mess up things. We have nothing clear yet and I don't know the code and the workflow of your application either. If you don't mind, can you or @m1ci open the GitHub issue please if needed? You guys know better than me what is going on.

fsasaki commented 8 years ago

Hi @xFran , I think this issue is about whether this type of output 1 dbc:British_Islands 15.94680660984782 2 dbc:Slavic_countries_and_territories 15.821275727763961 (see top of the issue) is relevant for you, and if not what could be changed. Above type of output is not the entity identification itself but about the topic, I assume this is why @jnehring proposed to have here only topic detection related discussion and in a different issue discussion about entity detection itself.

jnehring commented 8 years ago

Actually I think this issue is too long. We should close it. I created #53 to collect information about the quality of topic detection. @xFran please start new issues for the bugs you discovered. The issue looks very similar to https://github.com/freme-project/freme-ner/issues/46 maybe you can link in your new issue to https://github.com/freme-project/freme-ner/issues/46