Closed jnehring closed 8 years ago
I could not find a list of all HTML tags that produce newlines. Considering the HTML5 specification:
I do not know where to find an authorative list of HTML tags that should produce newlines. Apache Tika does this well, maybe we can find inspiration there.
Via email we agreed with @katia-vistatec that she will consider either converting all these tags to newlines or only the most important if all the tags is too much work.
Hi, and what about the tag title? Should it produce new lines also?
and what about the tag title? Should it produce new lines also?
Yes, good idea!
This is the list of hml tags I am considering for producing new lines: h1, h2, h3, h4, h5, h6, p, div, pre, title, article, section, address, footer, table, tr, td, thead, tbody, th, caption, ul, li, ol, dl, dt, dd, br, hr, img, iframe, canvas.
Pushed change to repository
I submit a small input file:
curl -X POST --header "Content-Type: text/html" --data "<p>hello world</p><div>abc</div><div>xyz</div>" "http://api-dev.freme-project.eu/current/toolbox/nif-converter?outformat=turtle"
It produced
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
<http://freme-project.eu/#char=14,19>
a nif:String , nif:RFC5147String , nif:Phrase ;
nif:anchorOf "\nabc\n"@en ;
nif:beginIndex "14"^^xsd:nonNegativeInteger ;
nif:endIndex "19"^^xsd:nonNegativeInteger ;
nif:referenceContext <http://freme-project.eu/#char=0,25> ;
dc:identifier "2" .
<http://freme-project.eu/#char=0,13>
a nif:String , nif:RFC5147String , nif:Phrase ;
nif:anchorOf "\nhello world\n"@en ;
nif:beginIndex "0"^^xsd:nonNegativeInteger ;
nif:endIndex "13"^^xsd:nonNegativeInteger ;
nif:referenceContext <http://freme-project.eu/#char=0,25> ;
dc:identifier "1" .
<http://freme-project.eu/#char=20,25>
a nif:String , nif:RFC5147String , nif:Phrase ;
nif:anchorOf "\nxyz\n"@en ;
nif:beginIndex "20"^^xsd:nonNegativeInteger ;
nif:endIndex "25"^^xsd:nonNegativeInteger ;
nif:referenceContext <http://freme-project.eu/#char=0,25> ;
dc:identifier "3" .
<http://freme-project.eu/#char=0,25>
a nif:String , nif:Context , nif:RFC5147String ;
nif:beginIndex "0"^^xsd:nonNegativeInteger ;
nif:endIndex "25"^^xsd:nonNegativeInteger ;
nif:isString "\nhello world\n \nabc\n \nxyz\n"@en .
I wonder why it produces three annotations. Is this correct?
When I try to submit a longer request with SB50.txt then the request never finishes. This seems to be a new bug:
curl -X POST --header "Content-Type: text/html" --data "@SB50.html" "http://api-dev.freme-project.eu/current/toolbox/nif-converter?outformat=turtle"
Hi, if SB50.html is the same file used to reproduce issue #35 I think there's some bug in internationalization that prevent processing this file correctly or maybe something not implemented. When I used this file (see issue 35) for testing with e-tilde translation service there was a response error from e-tilde service.
There is a new bug in e-Terminology caused by leading newlines: https://github.com/freme-project/e-services/issues/17#issuecomment-230841515
For the request that does not finish I created this bug report: https://github.com/freme-project/basic-services/issues/59 I think the rest of the issue is solved.
Some HTML tags should be converted to newlines, others not. This is important to have ideal input for underlying tokenizers / line splitters / sentence detectors. Some examples I created using the NIF Converter:
hello<br/>world
gets converted tohelloworld
, should get converted tohello\nworld
. So in this example the<br>
tag should get converted into a newline.Highlight <strong>in</strong> text.
gets converted toHighlight in text
without newlines which is totally ok.Question: How can we create a list of tags that should produce newlines? I think a newline should be generated by each HTML tag that produces a linebreak when being rendered to HTML, like
p
,div
,br
. But there might by others. This can also be influenced by CSS.