freme-project / e-Internationalization

Apache License 2.0
0 stars 0 forks source link

More newlines in conversion html -> plaintext #38

Closed jnehring closed 8 years ago

jnehring commented 8 years ago

When sending this homepage to freme ner using e-internationalization we discovered this behaviour:

Part of the input HTML:

It&rsquo;s almost as intense as facing down the Irish team on the pitch. Almost&hellip;<br />
&nbsp;<br />
Main image via <a href="http://www.independent.ie/sport/rugby/six-nations/six-nations-2015-six-reasons-why-ireland-will-win-the-championship-31074759.html">Irish Independent</a><br />

gets converted to

It’s almost as intense as facing down the Irish team on the pitch. Almost…   Main image via Irish Independent

Note that the newline between "Almost..." and "Main image" is gone. This confuses FREME NER and leads to "... Main" being detected as an entity.

The quality of NER gets better when there the text is

It’s almost as intense as facing down the Irish team on the pitch. Almost…   \n
Main image via Irish Independent

Open question: How to add newlines to best way? I think inline tags should not produce a newline, but all other tags should produce a newline.

jnehring commented 8 years ago

Shell commands to reproduce the issue:

wget http://www.visitdublin.com/rugby-in-dublin/#ru

curl -X POST --header "Content-Type: text/html" --data "@index.html" --header "Accept: text/turtle" "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia" > out.txt

open out.txt and check for line "Main image". The line is

Main image via Irish IndependentThe Butcher Grill

The original HTML has a newline between Indepedendent and The Butcher.

The solution for this is to preserve newlines when converting from HTML to NIF.

katia-vistatec commented 8 years ago

Pushed changes to remote repoository.

jnehring commented 8 years ago

Problem solved. Related issue: https://github.com/freme-project/e-Internationalization/issues/39

Thank you @katia-vistatec