Closed jnehring closed 8 years ago
What is the meaning of freme-onto in "freme-onto:info"? See below: { "@id": "dbpedia:Category:Western_Europe", "freme-onto:info": "14.821275727763961", "rdfs:label": { "@language": "en", "@value": "Western Europe" } }
This is "informativeness" score for the category computed out of DBpedia. Categories which are assigned to less entities in DBpedia, are considered as more informative. Those which are assigned to more entities, are considered to be less informative. To compute the informativeness values we use formulas from the information theory. See https://en.wikipedia.org/wiki/Self-information
The dataset with the topics counts and topics informativeness can be downloaded from here http://rv2622.1blu.de/datasets/dbpedia-categories/dbpedia-categories-counts.ttl
What is the meaning of freme-onto in "freme-onto:info"?
It is a prefix for a predicate.
ah, sorry, I was unclear. My question was: "what is the meaning of freme-onto:info ?"
Thanks for this. We will be looking at this later this week and start integrating. Thanks a million
kevin
I was playing around with topic detection. I send a news article about boxing to FREME NER and then wrote a script that extracts all categories from the output and sorts them by informativeness. Here is a sorted list of all 121 categories with their informativeness values:
1 dbc:British_Islands 15.94680660984782
2 dbc:Slavic_countries_and_territories 15.821275727763961
3 dbc:Boxing_venues_in_the_United_Kingdom 15.705798510344026
4 dbc:Free_labor 15.705798510344026
5 dbc:Member_states_of_the_Council_of_Europe 15.705798510344026
6 dbc:London_sub_regions 15.705798510344026
7 dbc:People_from_Scarborough,_North_Yorkshire_(district) 15.705798510344026
8 dbc:Jarrow 15.598883306427513
9 dbc:Port_cities_and_towns_in_England 15.598883306427513
10 dbc:Military_units_and_formations_of_Great_Britain_in_the_American_Revolutionary_War 15.499347632876601
11 dbc:Alumni_of_the_English_Martyrs_School_and_Sixth_Form_College 15.499347632876601
12 dbc:Staple_ports 15.406238228485117
13 dbc:Towns_in_West_Sussex 15.406238228485117
14 dbc:Sport_in_Tower_Hamlets 15.31877538723478
15 dbc:Post_towns_in_the_RH_postcode_area 15.31877538723478
16 dbc:Germanic_countries_and_territories 15.236313227042807
17 dbc:British_capitals 15.236313227042807
18 dbc:Local_government_in_West_Sussex 15.236313227042807
19 dbc:G20_nations 15.158310715041532
20 dbc:Altruism 15.158310715041532
21 dbc:1936_establishments_in_Australia 15.084310133597755
22 dbc:Areas_of_Rochdale_Borough 15.013920805706357
23 dbc:Recurring_sporting_events_established_in_1930 14.946806609847819
24 dbc:British_Armed_Forces 14.882676272428105
25 dbc:Western_Europe 14.821275727763961
26 dbc:Central_Europe 14.821275727763961
27 dbc:Australian_art_awards 14.821275727763961
28 dbc:Commonwealth_of_Nations 14.705798510344026
29 dbc:Populated_places_established_in_the_1st_century 14.705798510344026
30 dbc:Awards_established_in_1936 14.705798510344026
31 dbc:Ice_hockey_leagues_in_Ontario 14.651350726321649
32 dbc:Towns_in_Tyne_and_Wear 14.651350726321649
33 dbc:Commonwealth_sport 14.598883306427513
34 dbc:Crawley 14.598883306427513
35 dbc:Member_states_of_NATO 14.598883306427513
36 dbc:Visitor_attractions_in_Tower_Hamlets 14.598883306427513
37 dbc:Sports_originating_in_England 14.499347632876601
38 dbc:Royal_Marines 14.452041918098244
39 dbc:Summer_Olympic_sports 14.452041918098244
40 dbc:Northern_Europe 14.361844109126665
41 dbc:Geography_of_Tower_Hamlets 14.361844109126665
42 dbc:Post_towns_in_the_NE_postcode_area 14.318775387234778
43 dbc:Commonwealth_Games 14.318775387234778
44 dbc:Boxing 14.318775387234778
45 dbc:New_towns_in_England 14.276955211540152
46 dbc:Member_states_of_the_European_Union 14.276955211540152
47 dbc:Sports_organisations_of_the_United_Kingdom 14.236313227042807
48 dbc:Social_ethics 14.196784862856168
49 dbc:Member_states_of_the_Union_for_the_Mediterranean 14.158310715041532
50 dbc:Constitutional_monarchies 14.158310715041532
51 dbc:Politics_and_sports 14.12083600962287
52 dbc:English_geneticists 14.084310133597755
53 dbc:European_martial_arts 14.048686223867035
54 dbc:Fullerian_Professors_of_Physiology 13.914385132155443
55 dbc:Orthography 13.882676272428105
56 dbc:Arthurian_locations 13.821275727763961
57 dbc:People_from_Consett 13.791528384369908
58 dbc:Boxers_at_the_2014_Commonwealth_Games 13.762382038710392
59 dbc:Volunteerism 13.762382038710392
60 dbc:British_Commandos 13.762382038710392
61 dbc:Member_states_of_the_Commonwealth_of_Nations 13.67831777392192
62 dbc:Countries_in_Europe 13.651350726321649
63 dbc:Newcastle_United_F.C._non-playing_staff 13.573348214320376
64 dbc:Junior_ice_hockey_leagues_in_Canada 13.523595179123276
65 dbc:Civil_society 13.499347632876601
66 dbc:Capitals_in_Europe 13.499347632876601
67 dbc:Sports_venues_in_London 13.452041918098244
68 dbc:States_and_territories_established_in_1918 13.428958304985203
69 dbc:Local_government_districts_of_South_East_England 13.340149038027347
70 dbc:Fire 13.297713771706949
71 dbc:Giving 13.196784862856168
72 dbc:Multi-sport_events 13.139451687790215
73 dbc:History_of_Tower_Hamlets 13.102457480308015
74 dbc:Military_of_the_United_Kingdom 13.084310133597755
75 dbc:Basic_financial_concepts 13.084310133597755
76 dbc:Presidents_of_the_British_Science_Association 13.066388225600493
77 dbc:Combat_sports 13.048686223867035
78 dbc:Philanthropy 12.97997347378302
79 dbc:Writing_systems 12.93050479751872
80 dbc:Liberal_democracies 12.882676272428105
81 dbc:English-speaking_countries_and_territories 12.821275727763961
82 dbc:Reading_(process) 12.776881608405509
83 dbc:People_from_Hartlepool 12.733812886513624
84 dbc:Public_finance 12.719737701301899
85 dbc:Olympic_boxers_of_Great_Britain 12.638053903708192
86 dbc:Applied_linguistics 12.624878514960459
87 dbc:Buildings_and_structures_in_Tower_Hamlets 12.475500890922232
88 dbc:Public_administration 12.463723723145877
89 dbc:Republics 12.463723723145877
90 dbc:British_Empire 12.395010973061865
91 dbc:Island_countries 12.395010973061865
92 dbc:Individual_sports 12.361844109126665
93 dbc:Articles_including_recorded_pronunciations_(UK_English) 12.361844109126665
94 dbc:Team_sports 12.048686223867035
95 dbc:Taxation 11.938632678402122
96 dbc:Member_states_of_the_United_Nations 11.813781191217037
97 dbc:Social_philosophy 11.469600289482548
98 dbc:Educational_psychology 11.345542296797564
99 dbc:Boxers_at_the_2012_Summer_Olympics 11.246366891706728
100 dbc:Commonwealth_Games_gold_medallists_for_England 11.005358792202934
101 dbc:Finance 10.882676272428105
102 dbc:Royal_Medal_winners 10.765993292262772
103 dbc:English_boxers 10.598883306427513
104 dbc:Commonwealth_Games_competitors_for_England 10.340149038027345
105 dbc:People_educated_at_Rugby_School 10.251420119433014
106 dbc:Scunthorpe_United_F.C._players 10.21888615528462
107 dbc:Middleweight_boxers 10.209021535375065
108 dbc:Carlisle_United_F.C._players 10.0117755338748
109 dbc:Areas_of_London 9.969526686347905
110 dbc:Barnsley_F.C._players 9.89448557471774
111 dbc:Darlington_F.C._players 9.522067709376683
112 dbc:Alumni_of_St_John's_College,_Cambridge 9.073082878174501
113 dbc:1926_deaths 8.743569852967576
114 dbc:1861_births 8.426813033153644
115 dbc:Fellows_of_the_Royal_Society 7.1139165956830785
116 dbc:Articles_containing_video_clips 6.744905475689931
117 dbc:1961_births 6.23537427807319
118 dbc:1991_births 6.231624581249177
119 dbc:Association_football_defenders 5.542342354790045
120 dbc:English_footballers 5.408794089514977
121 dbc:Living_people -0.0
Out of the first 10, only "dbc:Boxing_venues_in_the_United_Kingdom" is related to the topic of the text. "dbc:British_Islands" might be ok also because it is about boxing in the UK. The category dbc:Boxing is ranked 44. There are many sport related categories, but they are spread all over the ranks / concentrated in the center and lower ranks. Category sports is not in the category list.
Actually the word "boxing" appears in almost every sentence of the document. Maybe we should not take only informativeness into account but also how frequently an entity appears?
The last informativeness value is 0 / negative ? Is this possible? Negative informativeness might indicate a bug.
I can share my script if anyone is interested. Its written in PHP.
The last informativeness value is 0 / negative ? Is this possible? Negative informativeness might indicate a bug.
Yes, it seems to be a bug. Actually the Living_people category is the least informative category - a category with the most DBpedia resources.
Actually the word "boxing" appears in almost every sentence of the document. Maybe we should not take only informativeness into account but also how frequently an entity appears?
maybe combination of both.
... waiting for feedback from @koidl and/or @xFran
@koidl and/or @xFran, if you have feedback on this could you please provide it in this issue? Thanks.
I'm busy right now. I have to write an app for Trinity College Dublin/FALCON Project. Soon as I have some time to implement topic detection in Wripl's application or Kevin has time to test it we will provide feedback. Sorry about this guys.
Ok. My other project is going well so, I may be able to take a look at this Monday. Is there anything where I have to focus more my attention? Or just use some random text and check if the topics makes sense. Topic detection is working only for English? Maybe the last question it's a silly one. But hey! Is mine! :-)
Any suggestions please? Thanks in advance.
I would suggest that you take text that is actually used by WRIPL and comes from one of the WRIPL domains, e.g. finance. It should be text where you have clear expectations what the topics are. I suggest you start with English text. I think it should work for all languages that FREME NER supports. So you can test other languages also as long as you understand them so you can judge if the topic detection is right.
Ok @jnehring. I will try to do may best following your recommendations. Regarding languages I can test in English, Spanish and Romanian but is not supported by NER engine I believe (it will be nice to have it in the future) and Kevin in German. Those are the languages we manage here.
Thank you.
Currently, entities can be enriched with categories only for entities spotted in English texts. Note that the categories are retrieved from DBpedia. You can check all available categories for an entity by 1) visiting the DBpedia page for the entity and looking at dcterms:subject, or 2) query them via our Freme LDF endpoint at rv2622.1blu.de:5000/dbpedia-types
Thank you for the note @m1ci.
Hi all
I'm running some tests here. Please see the attached text file.
First cUrl command on FREME's production server
curl -v -d @test.txt "http://api.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=&outformat=json-ld&language=en&dataset=dbpedia"
Response:
{
"timestamp": 1445250576139,
"error": "Bad Gateway",
"status": 502,
"exception": "eu.freme.broker.exception.ExternalServiceFailedException",
"path": "//e-entity/freme-ner/documents/"
* Closing connection 0
}
Second cUrl command on FREME's dev server
curl -v -d @test.txt "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=&outformat=json-ld&language=en&dataset=dbpedia"
Response:
HTTP/1.1 200 OK
But the strange part here is that even the response is 200 OK I don't have any entity spotted.
And no categories offcourse.
I really hope I did something wrong because if not this can affect Wripl a lot.
@xFran
You discovered a bug in FREME NER and i created a bug report for this: https://github.com/freme-project/freme-ner/issues/47
The problem lies within the &input=&outformat=...
part of your request. The input parameter overwrites the post body. So you submit an empty string to enrichment. I could reproduce the bug and when i remove the bit &input=
then it works.
Second test
I'm using the same boxing article as @jnehring used before.
FREME's production server I get the same error. No changes here.
In FREME's development server
curl -v -d @test.txt "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=&outformat=json-ld&language=en&dataset=dbpedia&enrichement=dbpedia-categories"
I get 3 entities and 11 categories not all related to boxing or sports but seems to make sense.
The category system is not deployed on production. Please use dev for testing.
For the categories I used dev. I just wanted to see if any differences sporting entities between dev and prod.
@xFran
Your cURL is not OK. cURL by default sets the Content-Type
header to application/x-www-form-urlencoded
which is then incorrectly interpreted by the API. When calling the service set also the Content-Type
header to an empty value using -H "Content-Type:"
curl -v -d @test-2.txt "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=&outformat=json-ld&language=en&dataset=dbpedia&enrichement=dbpedia-categories" -H "Content-Type:"
@m1ci Still no entities.
}(13:36:00) ○ [fran@xps] ~/Downloads
→ curl -v -d @test.txt "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&outformat=json-ld&language=en&dataset=dbpedia&enrichement=dbpedia-categories" -H "Content-Type:"
cUrl fixed.
Removing the first paragraph I get:
One entity
{
"@id" : "http://freme-project.eu/#char=10,17",
"@type" : [ "nif:Phrase", "nif:Word", "nif:RFC5147String", "nif:String" ],
"nif:anchorOf" : "Germany",
"beginIndex" : "10",
"endIndex" : "17",
"referenceContext" : "http://freme-project.eu/#char=0,2889",
"taClassRef" : "http://nerd.eurecom.fr/ontology#Location",
"itsrdf:taConfidence" : 0.5323128034518319,
"taIdentRef" : "dbpedia:Germany"
}
and 11 categories.
The categories are not really relevant for this article.
Germany/German is mentioned 13 times in the article so this entity is OK. China/China’s is mentioned 4 times and was not spotted as a entity.
@m1ci Still no entities.
I believe this is related to the https://github.com/freme-project/freme-ner/issues/41 issue
Can you please move the discussion about getting no entities to a new GitHub issue? We should not mix up different topics in the same discussion.
I can move it into another GitHub issue no probs. This is about "Feedback on topic detection" I don't know if this is a bug and/or if is there any relations between the issues or just my fault because I mess up things. We have nothing clear yet and I don't know the code and the workflow of your application either. If you don't mind, can you or @m1ci open the GitHub issue please if needed? You guys know better than me what is going on.
Hi @xFran , I think this issue is about whether this type of output 1 dbc:British_Islands 15.94680660984782 2 dbc:Slavic_countries_and_territories 15.821275727763961 (see top of the issue) is relevant for you, and if not what could be changed. Above type of output is not the entity identification itself but about the topic, I assume this is why @jnehring proposed to have here only topic detection related discussion and in a different issue discussion about entity detection itself.
Actually I think this issue is too long. We should close it. I created #53 to collect information about the quality of topic detection. @xFran please start new issues for the bugs you discovered. The issue looks very similar to https://github.com/freme-project/freme-ner/issues/46 maybe you can link in your new issue to https://github.com/freme-project/freme-ner/issues/46
@m1ci wrote
@koidl the topics and associated weights and labels are already provided as part of the output. See http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=Berlin+is+in+Germany.&outformat=turtle&language=en&dataset=dbpedia&enrichement=dbpedia-categories
The categories information are not included by default. You need to add the enrichement=dbpedia-categories parameter to include also the topics as part of the output.
Any feedback is more than welcome.