Closed jnehring closed 8 years ago
Shell commands to reproduce the issue:
wget http://www.visitdublin.com/rugby-in-dublin/#ru
curl -X POST --header "Content-Type: text/html" --data "@index.html" --header "Accept: text/turtle" "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia" > out.txt
open out.txt and check for line "Main image". The line is
Main image via Irish IndependentThe Butcher Grill
The original HTML has a newline between Indepedendent and The Butcher.
The solution for this is to preserve newlines when converting from HTML to NIF.
Pushed changes to remote repoository.
Problem solved. Related issue: https://github.com/freme-project/e-Internationalization/issues/39
Thank you @katia-vistatec
When sending this homepage to freme ner using e-internationalization we discovered this behaviour:
Part of the input HTML:
gets converted to
Note that the newline between "Almost..." and "Main image" is gone. This confuses FREME NER and leads to "... Main" being detected as an entity.
The quality of NER gets better when there the text is
Open question: How to add newlines to best way? I think inline tags should not produce a newline, but all other tags should produce a newline.