Conversion of HTML tags to newlines

jnehring commented 8 years ago

Some HTML tags should be converted to newlines, others not. This is important to have ideal input for underlying tokenizers / line splitters / sentence detectors. Some examples I created using the NIF Converter:

hello<br/>world gets converted to helloworld, should get converted to hello\nworld. So in this example the <br> tag should get converted into a newline.

Highlight <strong>in</strong> text. gets converted to Highlight in text without newlines which is totally ok.

Question: How can we create a list of tags that should produce newlines? I think a newline should be generated by each HTML tag that produces a linebreak when being rendered to HTML, like p, div, br. But there might by others. This can also be influenced by CSS.

jnehring commented 8 years ago

I could not find a list of all HTML tags that produce newlines. Considering the HTML5 specification:

All elements from "3.2.4.1.3 Sectioning content" and "3.2.4.1.4 Heading content" should produce newlines.
These tags should produce newlines also:
- article
- aside
- br
- canvas
- code
- fieldset
- hr
- iframe
- img
- ol
- p
- pre
- table
- ul
- li
- tr
- td

I do not know where to find an authorative list of HTML tags that should produce newlines. Apache Tika does this well, maybe we can find inspiration there.

jnehring commented 8 years ago

Via email we agreed with @katia-vistatec that she will consider either converting all these tags to newlines or only the most important if all the tags is too much work.

katia-vistatec commented 8 years ago

Hi, and what about the tag title? Should it produce new lines also?

jnehring commented 8 years ago

and what about the tag title? Should it produce new lines also?

Yes, good idea!

katia-vistatec commented 8 years ago

This is the list of hml tags I am considering for producing new lines: h1, h2, h3, h4, h5, h6, p, div, pre, title, article, section, address, footer, table, tr, td, thead, tbody, th, caption, ul, li, ol, dl, dt, dd, br, hr, img, iframe, canvas.

katia-vistatec commented 8 years ago

Pushed change to repository

jnehring commented 8 years ago

I submit a small input file:

curl -X POST --header "Content-Type: text/html" --data "<p>hello world</p><div>abc</div><div>xyz</div>" "http://api-dev.freme-project.eu/current/toolbox/nif-converter?outformat=turtle"

It produced

@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix nif:   <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix dc:    <http://purl.org/dc/elements/1.1/> .

<http://freme-project.eu/#char=14,19>
        a                     nif:String , nif:RFC5147String , nif:Phrase ;
        nif:anchorOf          "\nabc\n"@en ;
        nif:beginIndex        "14"^^xsd:nonNegativeInteger ;
        nif:endIndex          "19"^^xsd:nonNegativeInteger ;
        nif:referenceContext  <http://freme-project.eu/#char=0,25> ;
        dc:identifier         "2" .

<http://freme-project.eu/#char=0,13>
        a                     nif:String , nif:RFC5147String , nif:Phrase ;
        nif:anchorOf          "\nhello world\n"@en ;
        nif:beginIndex        "0"^^xsd:nonNegativeInteger ;
        nif:endIndex          "13"^^xsd:nonNegativeInteger ;
        nif:referenceContext  <http://freme-project.eu/#char=0,25> ;
        dc:identifier         "1" .

<http://freme-project.eu/#char=20,25>
        a                     nif:String , nif:RFC5147String , nif:Phrase ;
        nif:anchorOf          "\nxyz\n"@en ;
        nif:beginIndex        "20"^^xsd:nonNegativeInteger ;
        nif:endIndex          "25"^^xsd:nonNegativeInteger ;
        nif:referenceContext  <http://freme-project.eu/#char=0,25> ;
        dc:identifier         "3" .

<http://freme-project.eu/#char=0,25>
        a               nif:String , nif:Context , nif:RFC5147String ;
        nif:beginIndex  "0"^^xsd:nonNegativeInteger ;
        nif:endIndex    "25"^^xsd:nonNegativeInteger ;
        nif:isString    "\nhello world\n \nabc\n \nxyz\n"@en .

I wonder why it produces three annotations. Is this correct?

When I try to submit a longer request with SB50.txt then the request never finishes. This seems to be a new bug:

curl -X POST --header "Content-Type: text/html" --data "@SB50.html" "http://api-dev.freme-project.eu/current/toolbox/nif-converter?outformat=turtle"

katia-vistatec commented 8 years ago

Hi, if SB50.html is the same file used to reproduce issue #35 I think there's some bug in internationalization that prevent processing this file correctly or maybe something not implemented. When I used this file (see issue 35) for testing with e-tilde translation service there was a response error from e-tilde service.

ArneBinder commented 8 years ago

There is a new bug in e-Terminology caused by leading newlines: https://github.com/freme-project/e-services/issues/17#issuecomment-230841515

jnehring commented 8 years ago

For the request that does not finish I created this bug report: https://github.com/freme-project/basic-services/issues/59 I think the rest of the issue is solved.

freme-project / e-Internationalization

Conversion of HTML tags to newlines #39