freme-project / e-Entity

Apache License 2.0
1 stars 1 forks source link

e-Entity NIF of HTML does not return correct whitespace #69

Open bjdmeest opened 8 years ago

bjdmeest commented 8 years ago

When I do following request with the HTML below, the whitespace of the anchorOf values of the returning NIF is incorrect, e.g., instead of nif:anchorOf Greater \nAthens, nif:anchorOf Greater Athens is returned.

curl -X POST --header "Content-Type: text/html" --header "Accept: application/ld+json" -d "in.html" "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&mode=all"

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>
<p>For the Athenians the most popular way of dividing the City proper is through its neighbourhoods such as Pagkrati, Ambelokipi, Exarcheia, Patissia, Ilissia, Petralona, Koukaki and Kypseli, each with its own distinct history and characteristics.</p>
<p>The Athens municipality also forms the core and center of Greater
Athens which consists of the Athens municipality and 34 more
municipalities, which are divided in the four regional units (North,
West, Central and South Athens) mentioned above.</p>
</body>
</html>
fsasaki commented 8 years ago

This may be an aspect of the underlying e-Internationalisation service. Adding @borriellom to see what she thinks. If you want to keep line breaks in <p> html element adding a <br> element may help.

bjdmeest commented 8 years ago

The problem is not HTML specific, the result is the same for a pure textual input as well. Greater Athens is detected, which is good, but the nif:anchorOf does not match with the original text.

For the Athenians the most popular way of dividing the City proper is through its neighbourhoods such as Pagkrati, Ambelokipi, Exarcheia, Patissia, Ilissia, Petralona, Koukaki and Kypseli, each with its own distinct history and characteristics.
The Athens municipality also forms the core and center of Greater
Athens which consists of the Athens municipality and 34 more
municipalities, which are divided in the four regional units (North,
West, Central and South Athens) mentioned above.
fsasaki commented 8 years ago

Thanks for pointing this out, @bjdmeest , so this is indeed a different issue.

jnehring commented 8 years ago

The same problem occurs in DBPedia Spotlight.

m1ci commented 8 years ago

@bjdmeest this is related to https://github.com/freme-project/freme-ner/issues/59 See the discussion and the solution in https://github.com/freme-project/freme-ner/issues/59

If that does not solve the issue, feel free to reopen it so we can further investigate.

bjdmeest commented 8 years ago

Actually, the curl below (so without --data or --data-binary) also does not return the correct result (i.e., nif:anchorOf "Greater Athens"^^xsd:string ; instead of nif:anchorOf """Greater
Athens"""^^xsd:string ;).

curl -X POST --header "Content-Type: text/html" --header "Accept: text/n3" "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?input=The%20Athens%20municipality%20also%20forms%20the%20core%20and%20center%20of%20Greater%0AAthens%20which%20consists%20of%20the%20Athens%20municipality%20and%2034%20more%0Amunicipalities&informat=text&outformat=turtle&language=en&dataset=dbpedia&mode=all"

Neither does the ajax request below

 $.ajax('http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&mode=all',
            {
                method: 'POST',
                headers: {
                    'Content-Type': 'text/html'
                },
                data: '<p>The Athens municipality also forms the core and center of Greater\nAthens which consists of the Athens municipality and 34 more\nmunicipalities</p>',
                success: function (data) {
                    console.log(data)
                },
                crossDomain: true
            })
m1ci commented 8 years ago

Thanks we will investigate this and get back to you.

m1ci commented 8 years ago

@sandroacoelho can you look at it? See the explanation bellow

Following request:

curl -v "http://rv2622.1blu.de:8081/api/entities?format=TTL&language=en&dataset=dbpedia" --data-binary @doc.txt where the document is doc.txt

In the results we get nif:anchorOf "Greater Athens"^^xsd:string ;

but it should be nif:anchorOf "Greater\nAthens"^^xsd:string ;