Closed jnehring closed 8 years ago
I've tested following simple text
http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents/?informat=text&input=IBM shares fell more than 5 percent in after-hours trading Monday after the company announced third-quarter revenues that missed expectations, and offered weaker than expected guidance for the year. The tech giant reported earnings of $3.34 per share on $19.28 billion in revenue. Analysts had expected IBM to report earnings of about $3.30 a share on $19.62 billion in revenue, according to a consensus estimate from Thomson Reuters. IBM's revenue figure came in below the Wall Street low: The lowest revenue estimate of 17 analysts for the quarter was $19.292 billion. Monday's announcement marked the 14th straight quarter that IBM's revenues fell. Still, the company said its revenue change — a 19 percent decline from the same time last year — was only a 1 percent fall adjusting for currency and other factors. A few things to note...within the revenue we report we have a pretty substantial currency headwind we continue to deal with and as we transform our business we continue to move out of areas where we don't see long term value, so across our revenue base that's about 13 points of impact so excluding those two in the third quarter we reported a down about 1 percent revenue," IBM CFO Martin Schroeter told CNBC after the earnings report&outformat=turtle&language=en&dataset=dbpedia&enrichement=dbpedia-categories
As output I get:
@prefix dbpedia-fr: http://fr.dbpedia.org/resource/ . @prefix dbc: http://dbpedia.org/resource/Category: . @prefix dbpedia-es: http://es.dbpedia.org/resource/ . @prefix xsd: http://www.w3.org/2001/XMLSchema# . @prefix itsrdf: http://www.w3.org/2005/11/its/rdf# . @prefix dbpedia: http://dbpedia.org/resource/ . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix nif: http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core# . @prefix dbpedia-de: http://de.dbpedia.org/resource/ . @prefix dbpedia-ru: http://ru.dbpedia.org/resource/ . @prefix freme-onto: http://freme-project.eu/ns# . @prefix dbpedia-nl: http://nl.dbpedia.org/resource/ . @prefix dcterms: http://purl.org/dc/terms/ . @prefix dbpedia-it: http://it.dbpedia.org/resource/ .
dbc:Thomson_Reuters rdfs:label "Thomson Reuters"@en ; freme-onto:info "13.705798510344026" .
dbc:1911_establishments_in_the_United_States rdfs:label "1911 establishments in the United States"@en ; freme-onto:info "13.914385132155443" .
http://freme-project.eu/#char=474,485 a nif:RFC5147String , nif:Phrase , nif:Word , nif:String ; nif:anchorOf "Wall Street"^^xsd:string ; nif:beginIndex "474"^^xsd:int ; nif:endIndex "485"^^xsd:int ; nif:referenceContext http://freme-project.eu/#char=0,1252 ; itsrdf:taClassRef http://nerd.eurecom.fr/ontology#Location ; itsrdf:taConfidence "0.6404511836029554"^^xsd:double ; itsrdf:taIdentRef dbpedia:Wall_Street .
dbc:Forts_of_New_Netherland rdfs:label "Forts of New Netherland"@en ; freme-onto:info "15.406238228485117" .
http://dbpedia.org/resource/Category:S&P/TSX_60_Index rdfs:label "S&P/TSX 60 Index"@en ; freme-onto:info "13.475500890922232" .
dbc:Computer_storage_companies rdfs:label "Computer storage companies"@en ; freme-onto:info "12.624878514960459" .
http://freme-project.eu/#char=1222,1226 a nif:String , nif:RFC5147String , nif:Word , nif:Phrase ; nif:anchorOf "CNBC"^^xsd:string ; nif:beginIndex "1222"^^xsd:int ; nif:endIndex "1226"^^xsd:int ; nif:referenceContext http://freme-project.eu/#char=0,1252 ; itsrdf:taClassRef http://nerd.eurecom.fr/ontology#Organization ; itsrdf:taConfidence "0.9769373255376529"^^xsd:double ; itsrdf:taIdentRef dbpedia:CNBC .
http://freme-project.eu/#char=0,1252 a nif:String , nif:Context , nif:RFC5147String ; nif:beginIndex "0"^^xsd:int ; nif:endIndex "1252"^^xsd:int ; nif:isString "IBM shares fell more than 5 percent in after-hours trading Monday after the company announced third-quarter revenues that missed expectations, and offered weaker than expected guidance for the year. The tech giant reported earnings of $3.34 per share on $19.28 billion in revenue. Analysts had expected IBM to report earnings of about $3.30 a share on $19.62 billion in revenue, according to a consensus estimate from Thomson Reuters. IBM's revenue figure came in below the Wall Street low: The lowest revenue estimate of 17 analysts for the quarter was $19.292 billion. Monday's announcement marked the 14th straight quarter that IBM's revenues fell. Still, the company said its revenue change — a 19 percent decline from the same time last year — was only a 1 percent fall adjusting for currency and other factors. A few things to note...within the revenue we report we have a pretty substantial currency headwind we continue to deal with and as we transform our business we continue to move out of areas where we don't see long term value, so across our revenue base that's about 13 points of impact so excluding those two in the third quarter we reported a down about 1 percent revenue,\" IBM CFO Martin Schroeter told CNBC after the earnings report"^^xsd:string .
dbc:Electronics_companies_of_the_United_States rdfs:label "Electronics companies of the United States"@en ; freme-onto:info "10.795213431177766" .
dbpedia:Thomson_Reuters dcterms:subject dbc:Companies_established_in_2008 , dbc:Companies_based_in_Manhattan , dbc:Media_companies_based_in_New_York_City , dbc:Companies_listed_on_the_Toronto_Stock_Exchange , dbc:Bibliographic_database_providers , dbc:2008_establishments_in_Ontario , dbc:Media_companies_of_Canada , dbc:Thomson_Reuters , http://dbpedia.org/resource/Category:S&P/TSX_60_Index , dbc:Companies_listed_on_the_New_York_Stock_Exchange , dbc:Financial_data_vendors , dbc:Publicly_traded_companies_based_in_New_York_City , dbc:Multinational_companies_based_in_New_York_City .
http://freme-project.eu/#char=0,3 a nif:RFC5147String , nif:Phrase , nif:Word , nif:String ; nif:anchorOf "IBM"^^xsd:string ; nif:beginIndex "0"^^xsd:int ; nif:endIndex "3"^^xsd:int ; nif:referenceContext http://freme-project.eu/#char=0,1252 ; itsrdf:taClassRef http://nerd.eurecom.fr/ontology#Organization ; itsrdf:taConfidence "0.9955540309317645"^^xsd:double ; itsrdf:taIdentRef dbpedia:IBM .
http://freme-project.eu/#char=418,433 a nif:String , nif:RFC5147String , nif:Word , nif:Phrase ; nif:anchorOf "Thomson Reuters"^^xsd:string ; nif:beginIndex "418"^^xsd:int ; nif:endIndex "433"^^xsd:int ; nif:referenceContext http://freme-project.eu/#char=0,1252 ; itsrdf:taClassRef http://nerd.eurecom.fr/ontology#Organization ; itsrdf:taConfidence "0.8978024142458758"^^xsd:double ; itsrdf:taIdentRef dbpedia:Thomson_Reuters .
dbc:Digital-only_radio_stations rdfs:label "Digital-only radio stations"@en ; freme-onto:info "11.784186409028742" .
dbc:U.S._Route_9W rdfs:label "U.S. Route 9W"@en ; freme-onto:info "14.598883306427513" .
dbc:Semiconductor_companies rdfs:label "Semiconductor companies"@en ; freme-onto:info "13.102457480308015" .
dbc:Publicly_traded_companies_based_in_New_York_City rdfs:label "Publicly traded companies based in New York City"@en ; freme-onto:info "14.821275727763961" .
dbc:Media_companies_of_Canada rdfs:label "Media companies of Canada"@en ; freme-onto:info "15.598883306427513" .
dbc:Companies_based_in_Manhattan rdfs:label "Companies based in Manhattan"@en ; freme-onto:info "12.452041918098242" .
dbc:Companies_listed_on_the_New_York_Stock_Exchange rdfs:label "Companies listed on the New York Stock Exchange"@en ; freme-onto:info "8.830698981650588" .
dbc:Media_companies_based_in_New_York_City rdfs:label "Media companies based in New York City"@en ; freme-onto:info "13.406238228485119" .
dbc:Cloud_computing_providers rdfs:label "Cloud computing providers"@en ; freme-onto:info "11.733812886513624" .
dbc:Television_stations_in_New_Jersey rdfs:label "Television stations in New Jersey"@en ; freme-onto:info "13.406238228485119" .
dbpedia:Wall_Street dcterms:subject dbc:Colonial_forts_in_New_York , dbc:Wall_Street , dbc:Occupy_Wall_Street , dbc:Forts_of_New_Netherland , http://dbpedia.org/resource/Category:Financial_District,_Manhattan , dbc:Streets_in_Manhattan .
dbc:Peabody_Award_winners rdfs:label "Peabody Award winners"@en ; freme-onto:info "10.878761222424721" .
dbc:Computer_hardware_companies rdfs:label "Computer hardware companies"@en ; freme-onto:info "12.102457480308015" .
dbc:Companies_listed_on_the_Toronto_Stock_Exchange rdfs:label "Companies listed on the Toronto Stock Exchange"@en ; freme-onto:info "10.918398194662066" .
dbc:Foundry_semiconductor_companies rdfs:label "Foundry semiconductor companies"@en ; freme-onto:info "15.158310715041532" .
dbc:Companies_established_in_2008 rdfs:label "Companies established in 2008"@en ; freme-onto:info "10.726758128979672" .
dbpedia:CNBC dcterms:subject dbc:CNBC_global_channels , dbc:Digital-only_radio_stations , dbc:English-language_television_stations_in_the_United_States , dbc:24-hour_television_news_channels_in_the_United_States , dbc:NBCUniversal_networks , dbc:Business-related_television_channels , dbc:Peabody_Award_winners , dbc:Television_stations_in_New_Jersey , dbc:U.S._Route_9W , dbc:NBCUniversal , dbc:Television_channels_and_stations_established_in_1989 .
dbc:Point_of_sale_companies rdfs:label "Point of sale companies"@en ; freme-onto:info "14.158310715041532" .
dbc:Colonial_forts_in_New_York rdfs:label "Colonial forts in New York"@en ; freme-onto:info "14.705798510344026" .
dbc:Financial_data_vendors rdfs:label "Financial data vendors"@en ; freme-onto:info "14.276955211540152" .
dbc:Companies_established_in_1896 rdfs:label "Companies established in 1896"@en ; freme-onto:info "13.256491108980436" .
dbc:Occupy_Wall_Street rdfs:label "Occupy Wall Street"@en ; freme-onto:info "14.361844109126665" .
http://dbpedia.org/resource/Category:Companies_based_in_Westchester_County,_New_York rdfs:label "Companies based in Westchester County, New York"@en ; freme-onto:info "14.361844109126665" .
http://dbpedia.org/resource/Category:Financial_District,_Manhattan rdfs:label "Financial District, Manhattan"@en ; freme-onto:info "12.821275727763961" .
dbc:1896_establishments_in_the_United_States rdfs:label "1896 establishments in the United States"@en ; freme-onto:info "14.452041918098244" .
http://freme-project.eu/#char=435,438 a nif:String , nif:RFC5147String , nif:Phrase , nif:Word ; nif:anchorOf "IBM"^^xsd:string ; nif:beginIndex "435"^^xsd:int ; nif:endIndex "438"^^xsd:int ; nif:referenceContext http://freme-project.eu/#char=0,1252 ; itsrdf:taClassRef http://nerd.eurecom.fr/ontology#Organization ; itsrdf:taConfidence "0.9960635494338839"^^xsd:double ; itsrdf:taIdentRef dbpedia:IBM .
dbc:CNBC_global_channels rdfs:label "CNBC global channels"@en ; freme-onto:info "15.499347632876601" .
dbc:NBCUniversal rdfs:label "NBCUniversal"@en ; freme-onto:info "12.97997347378302" .
dbc:News rdfs:label "News"@en ; freme-onto:info "15.406238228485117" .
http://freme-project.eu/#char=1192,1195 a nif:RFC5147String , nif:Word , nif:String , nif:Phrase ; nif:anchorOf "IBM"^^xsd:string ; nif:beginIndex "1192"^^xsd:int ; nif:endIndex "1195"^^xsd:int ; nif:referenceContext http://freme-project.eu/#char=0,1252 ; itsrdf:taClassRef http://nerd.eurecom.fr/ontology#Organization ; itsrdf:taConfidence "0.9729965656552476"^^xsd:double ; itsrdf:taIdentRef dbpedia:IBM .
dbc:NBCUniversal_networks rdfs:label "NBCUniversal networks"@en ; freme-onto:info "13.882676272428105" .
dbc:Collier_Trophy_recipients rdfs:label "Collier Trophy recipients"@en ; freme-onto:info "12.97997347378302" .
dbc:Display_technology_companies rdfs:label "Display technology companies"@en ; freme-onto:info "12.97997347378302" .
http://freme-project.eu/#char=303,306 a nif:RFC5147String , nif:Word , nif:String , nif:Phrase ; nif:anchorOf "IBM"^^xsd:string ; nif:beginIndex "303"^^xsd:int ; nif:endIndex "306"^^xsd:int ; nif:referenceContext http://freme-project.eu/#char=0,1252 ; itsrdf:taClassRef http://nerd.eurecom.fr/ontology#Organization ; itsrdf:taConfidence "0.9918151685372821"^^xsd:double ; itsrdf:taIdentRef dbpedia:IBM .
dbc:Bibliographic_database_providers rdfs:label "Bibliographic database providers"@en ; freme-onto:info "14.048686223867035" .
dbc:Business-related_television_channels rdfs:label "Business-related television channels"@en ; freme-onto:info "14.762382038710392" .
dbc:American_brands rdfs:label "American brands"@en ; freme-onto:info "11.748026745733323" .
http://freme-project.eu/#char=1200,1216 a nif:RFC5147String , nif:String , nif:Phrase , nif:Word ; nif:anchorOf "Martin Schroeter"^^xsd:string ; nif:beginIndex "1200"^^xsd:int ; nif:endIndex "1216"^^xsd:int ; nif:referenceContext http://freme-project.eu/#char=0,1252 ; itsrdf:taClassRef http://nerd.eurecom.fr/ontology#Person ; itsrdf:taConfidence "0.8597548963180837"^^xsd:double .
dbc:Wall_Street rdfs:label "Wall Street"@en ; freme-onto:info "14.361844109126665" .
dbc:Streets_in_Manhattan rdfs:label "Streets in Manhattan"@en ; freme-onto:info "12.158310715041532" .
dbpedia:IBM dcterms:subject dbc:Cloud_computing_providers , dbc:Point_of_sale_companies , dbc:Computer_storage_companies , dbc:Collier_Trophy_recipients , dbc:Software_companies_based_in_New_York , dbc:UML_Partners , dbc:1911_establishments_in_the_United_States , dbc:Outsourcing_companies , dbc:IBM , dbc:Semiconductor_companies , dbc:Foundry_semiconductor_companies , dbc:News , dbc:Electronics_companies_of_the_United_States , dbc:Computer_companies_of_the_United_States , dbc:Companies_established_in_1896 , dbc:Companies_listed_on_the_New_York_Stock_Exchange , dbc:Display_technology_companies , dbc:Multinational_companies_headquartered_in_the_United_States , dbc:National_Medal_of_Technology_recipients , dbc:1896_establishments_in_the_United_States , dbc:Computer_hardware_companies , dbc:American_brands , http://dbpedia.org/resource/Category:Companies_based_in_Westchester_County,_New_York , dbc:Companies_in_the_Dow_Jones_Industrial_Average .
dbc:National_Medal_of_Technology_recipients rdfs:label "National Medal of Technology recipients"@en ; freme-onto:info "12.361844109126665" .
dbc:Software_companies_based_in_New_York rdfs:label "Software companies based in New York"@en ; freme-onto:info "13.276955211540152" .
dbc:UML_Partners rdfs:label "UML Partners"@en ; freme-onto:info "15.406238228485117" .
dbc:Television_channels_and_stations_established_in_1989 rdfs:label "Television channels and stations established in 1989"@en ; freme-onto:info "11.798907914735507" .
http://freme-project.eu/#char=631,634 a nif:String , nif:RFC5147String , nif:Phrase , nif:Word ; nif:anchorOf "IBM"^^xsd:string ; nif:beginIndex "631"^^xsd:int ; nif:endIndex "634"^^xsd:int ; nif:referenceContext http://freme-project.eu/#char=0,1252 ; itsrdf:taClassRef http://nerd.eurecom.fr/ontology#Organization ; itsrdf:taConfidence "0.9915816574782368"^^xsd:double ; itsrdf:taIdentRef dbpedia:IBM .
dbc:Outsourcing_companies rdfs:label "Outsourcing companies"@en ; freme-onto:info "12.361844109126665" .
dbc:IBM rdfs:label "IBM"@en ; freme-onto:info "13.428958304985203" .
dbc:2008_establishments_in_Ontario rdfs:label "2008 establishments in Ontario"@en ; freme-onto:info "13.499347632876601" .
dbc:English-language_television_stations_in_the_United_States rdfs:label "English-language television stations in the United States"@en ; freme-onto:info "11.651350726321649" .
dbc:Companies_in_the_Dow_Jones_Industrial_Average rdfs:label "Companies in the Dow Jones Industrial Average"@en ; freme-onto:info "14.499347632876601" .
dbc:Computer_companies_of_the_United_States rdfs:label "Computer companies of the United States"@en ; freme-onto:info "11.053091402987036" .
dbc:Multinational_companies_headquartered_in_the_United_States rdfs:label "Multinational companies headquartered in the United States"@en ; freme-onto:info "12.791528384369911" .
dbc:24-hour_television_news_channels_in_the_United_States rdfs:label "24-hour television news channels in the United States"@en ; freme-onto:info "13.475500890922232" .
dbc:Multinational_companies_based_in_New_York_City rdfs:label "Multinational companies based in New York City"@en ; freme-onto:info "14.882676272428105"
This is a earning report from IBM. Following questions:
dbc:Thomson_Reuters rdfs:label "Thomson Reuters"@en ; freme-onto:info "13.705798510344026" .
dbc:1911_establishments_in_the_United_States rdfs:label "1911 establishments in the United States"@en ; freme-onto:info "13.914385132155443" .
Are these topics that express the whole document based on their occurance in the topics of each entity?
http://freme-project.eu/#char=474,485 a nif:RFC5147String , nif:Phrase , nif:Word , nif:String ; nif:anchorOf "Wall Street"^^xsd:string ; nif:beginIndex "474"^^xsd:int ; nif:endIndex "485"^^xsd:int ; nif:referenceContext http://freme-project.eu/#char=0,1252 ; itsrdf:taClassRef http://nerd.eurecom.fr/ontology#Location ; itsrdf:taConfidence "0.6404511836029554"^^xsd:double ; itsrdf:taIdentRef dbpedia:Wall_Street .
dbc:Forts_of_New_Netherland rdfs:label "Forts of New Netherland"@en ; freme-onto:info "15.406238228485117" .
http://dbpedia.org/resource/Category:S&P/TSX_60_Index rdfs:label "S&P/TSX 60 Index"@en ; freme-onto:info "13.475500890922232" .
dbc:Computer_storage_companies rdfs:label "Computer storage companies"@en ; freme-onto:info "12.624878514960459" .
@m1ci @jnehring quick question: what github issue relates to the idea of using the TYPES to filter the returns a bit more. We simply still get way too many entities. They may all be correct but based on the use case are not needed.
My idea here is to hit it will all we have which is using general purpose (as is now), adding domain specific taxonomies (which looks the same as the wikipedia categories just that the terminology is from the taxonmy not from wikipeda) and finnally TYPE.
The enduser can set flags either see it all or just variations of the three options above.
Let me know what you think
This is a earning report from IBM. Following questions:
The topics listed on the top: dbc:Thomson_Reuters rdfs:label "Thomson Reuters"@en ; freme-onto:info "13.705798510344026" .
dbc:1911_establishments_in_the_United_States rdfs:label "1911 establishments in the United States"@en ; freme-onto:info "13.914385132155443" .
dbc:Thomson_Reuters is category assigned to the dbpedia:Thomson_Reuters entity dbc:1911_establishments_in_the_United_States is category assigned to the dbpedia:IBM entity
Are these topics that express the whole document based on their occurance in the topics of each entity?
No, they relate to the entities occurring to the document. If the entity is central in the document, then with high probability the categories will also be relevant for the document. Note that those are categories assigned to the entities occurring in the document.
I didnt look at others yet but I am assuming this will be similar in most cases and most likely due to the domain specfic question. If I am looking for historical topics related to 'Wall Street' the first topic might be spot on. I will send on some ideas in the folder Jan prepared.
Let me one more time stress out one important point: topic detection (or document classification) won't work if you rely only on entities. You need to consider also terms. My suggestion is that you build a document classification system on top of the results provided by e-Entity and e-Terminology. To implement this, 1) you'll need training data - that is documents with assigned topics, 2) decide what information from e-Entity (e.g., entity types, or entity categories) and e-Terminology you are going to use for training. IMO, this is the only and the right way to go.
Thanks for your reply @m1ci
Related to the document classification system how much data is required and what kind of data.
In this finance example we get over 1000 topics in a hirarchical list. I assume we need to add training data to each topic? Related to this how much training data and what quality (can we use weblinks to fr example the articel I am using above)?
thanks
@m1ci @jnehring quick question: what github issue relates to the idea of using the TYPES to filter the returns a bit more. We simply still get way too many entities. They may all be correct but based on the use case are not needed.
My idea here is to hit it will all we have which is using general purpose (as is now), adding domain specific taxonomies (which looks the same as the wikipedia categories just that the terminology is from the taxonmy not from wikipeda) and finnally TYPE.
How will you link an entity with a taxonomy entry? Via the Wikipedia categories? Note that there are more than 900K Wikipedia categories.
What do you mean by TYPE? Filtering out entities from a specific domain? For example, spot entities only from the Sports or Politics domain?
There is no such issue. After 0.4 is done, we will continue working on that.
Yes filter a specific domain. In the dbpedia spotlight demo there is a list one can select.
Related to the document classification system how much data is required and what kind of data.
Very valid question.
how much data is required
Hard to say, depends on the data. Maybe you can first train with lets say some hundreds of docs, see how it performs, and then increase the training size.
what kind of data.
Well, for training you'll need some hundreds of documents which will be processed with e-entity and e-terminology. Also you will need topics assigned to each document. The results from e-entity and e-terminology will be used as features for training, while the topics will be used as classes for learning.
Yes filter a specific domain. In the dbpedia spotlight demo there is a list one can select.
With DBpedia Spotlight you can specify entity types and not domains. Just to have it clear.
ah okay thats what I mean Entity types
We then get three enrichments (please correct me if I am wrong):
I have to confess the third is not clear to me exactly. Especially the latter part. My understanding currently:
What then? Do when then get document classification and NER? I am still not sure how this will connect.
I am worried about the data requirement. Is there a simpler (but not as specise approach) such as semantic matching which only tries to count words in a document that are similar as the terms in a taxonomy and takes the term with the most hits (not sure if that exisits)
it seems to me that all three categories are not very useful. How do they relate to Wallstreet?
I understood the algorithm that you have to sort all categories according to the score in freme-onto:info and take the top-n. With this algorithm the top 3 categories in your example text are
dbc:Media_companies_of_Canada
rdfs:label "Media companies of Canada"@en ;
freme-onto:info "15.598883306427513" .
dbc:CNBC_global_channels
rdfs:label "CNBC global channels"@en ;
freme-onto:info "15.499347632876601" .
dbc:Forts_of_New_Netherland
rdfs:label "Forts of New Netherland"@en ;
freme-onto:info "15.406238228485117" .
But still I cannot see a relation from the categories to the text.
This issue is no longer up to date and can be closed.
This issue is about collecting feedback on the quality of topic detection. We want to find out if the topic detection implemented by InfAI can solve the requirements of WRIPL when WRIPL asked for a topic detection.
This issue is a fresh start of the long discussion in #50.