freme-project / freme-ner

Apache License 2.0
6 stars 1 forks source link

About titles, headings and newlines. #138

Closed ghsnd closed 8 years ago

ghsnd commented 8 years ago

Hi,

Not entirely sure if this issue is a real issue, and if it belongs to FREME NER or e-Internationalization...

I have an HTML page here: history.html.txt.

It starts with

<!DOCTYPE HTML>
<html>
    <head>
        <title>Origins</title>
        <meta charset="UTF-8">
        <link rel="stylesheet" type="text/css" href="media/style.css">
    </head>
    <body>
        <main>
        <h1>Origins</h1>
        <figure><img src="media/Leipzig_1632.jpg" alt="Leipzig in 1632"></figure>
        <p>Leipzig was first documented in 1015...
...

Apply FREME NER (dev) with the following command:

curl -X POST --header 'Content-Type: text/html' --header 'Accept: text/turtle' -d '@history.html.txt' 'http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&mode=spot%2Clink'

Having the word Origins in the <title> and a <h1>, it produces a.o. this in the output:

<http://freme-project.eu/#char=0,1579>
        a               nif:String , nif:Context , nif:RFC5147String ;
        nif:beginIndex  "0"^^xsd:int ;
        nif:endIndex    "1579"^^xsd:int ;
        nif:isString    "\nOrigins\n \nOrigins\n \n\n \nLeipzig was first documented in 1015 ... "^^xsd:string .

which, IMO, seems good (correct me if I'm wrong here), though the first Origins comes from the HTML title, which is only displayed as title of the browser window.

FREME NER detects Origins Origins Leipzig as an entity:

<http://freme-project.eu/#char=1,31>
        a                     nif:RFC5147String , nif:String , nif:Phrase , nif:Word ;
        nif:anchorOf          "Origins Origins Leipzig"^^xsd:string ;
        nif:beginIndex        "1"^^xsd:int ;
        nif:endIndex          "31"^^xsd:int ;
        nif:referenceContext  <http://freme-project.eu/#char=0,1579> ;
        itsrdf:taConfidence   "0.8093460384423357"^^xsd:double .

And what I actually expect is that Leipzig is detected as an entity.

So my questions are:

  1. Should data in the <title> section appear in the input for NER?
  2. Should text separated by a few newlines (\nOrigins\n \nOrigins\n \n\n \nLeipzig in this case) be detectable as one entity?
m1ci commented 8 years ago

Your questions:

  1. Should data in the <title> section appear in the input for NER?
  2. Should text separated by a few newlines (\nOrigins\n \nOrigins\n \n\n \nLeipzig in this case) be detectable as one entity?

...relate to e-Internationalization - @katia-vistatec . Maybe I can react on 2) The text is quite irregular, FREME NER was trained on "normal" texts.

katia-vistatec commented 8 years ago

Hi. The title appears in the nif file because it is a text unit just like the text in paragraphs or headings. I don't know if there is some reason to have it in the nif.

ghsnd commented 8 years ago

Thanks for the answers. I think this is more a philosophical discussion than a technical one, so let's close it.