freme-project / freme-ner

Apache License 2.0
6 stars 1 forks source link

entities not recognized for the same sentence in different input #48

Closed fsasaki closed 9 years ago

fsasaki commented 9 years ago

I submitted the following sentence two FREME NER

Hundreds of thousands of migrants, many from Syria, Africa and Afghanistan, have been making their way from Turkey to the Balkans in recent months, in a bid to reach Germany, Sweden and other EU states.

The first request (see attachment request1.txt) had this sentence as part of an HTML file taken from http://www.bbc.com/news/world-europe-34576045 . Here FREME NER recognized only three entities in the whole text and none in above sentence, see out1.txt

The same sentence submitted without other content to FREME NER leads to many more entities recognized. See the attachents for the request and output. request1.txt out2.txt request2.txt out1.txt

jnehring commented 9 years ago

Looks similiar to #46

m1ci commented 9 years ago

Thanks for reporting! Fixed with https://github.com/freme-project/freme-ner/commit/eae252d765f680c7622dcda953e6cf69371cd3ac

Note that the entity spotting models were trained and work well on "clean" content - content without markup, no many blank spaces between the tokens, etc. We assume that FREME NER clients clean their content before they submit it to FREME NER. This is already understood and considered by Wripl - the clean the content before it is submitted to FREME NER.

fsasaki commented 9 years ago

Great, thank you! About the clean up: that is OK in the request2, since the content type is text/html. That evokes e-Internationalisation (Okapi) and in that way FREME-NER receives the clean content. I'll then close this bug.