freme-project / e-Entity

Apache License 2.0
1 stars 1 forks source link

http://dbpedia.org/page/Not detected as entity #49

Closed jnehring closed 8 years ago

jnehring commented 8 years ago

Not is detected as entity. E.g. this call

curl -X POST --header "Content-Type: text/n3" --header "Accept: text/n3" "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?input=This%20is%20a%20FREE%20report%20from%20Insider%20Monkey.%20Credit%20Card%20is%20NOT%20required.&informat=text&outformat=turtle&language=en&dataset=dbpedia"

produces this NIF:

<http://freme-project.eu/#char=58,61>
        a                     nif:RFC5147String , nif:String , nif:Phrase , nif:Word ;
        nif:anchorOf          "NOT"^^xsd:string ;
        nif:beginIndex        "58"^^xsd:int ;
        nif:endIndex          "61"^^xsd:int ;
        nif:referenceContext  <http://freme-project.eu/#char=0,71> ;
        itsrdf:taClassRef     <http://www.w3.org/2002/07/owl#Thing> ;
        itsrdf:taConfidence   "1.0"^^xsd:double ;
        itsrdf:taIdentRef     <http://dbpedia.org/resource/Not> .

When I look at http://dbpedia.org/page/Not and the corresponding wikipedia page http://en.wikipedia.org/wiki/Not then I wonder what kind of named entity this is. It seems that Not is a disambiguation page.

How about excluding disambiguation pages in general from the named entity detection? I think they will produce bad entities in every case.

I just dont know how to identify a page as disambiguation page. Maybe disambiguation pages have the property dbo:wikiPageDisambiguates?

johnmcauley commented 8 years ago

Thanks Jan,

I have added the file of examples to Gdrive - https://drive.google.com/open?id=0B1v6TnDXhoIbVXVEMHE1ZnliQW8

Let me know if you need anything else?

m1ci commented 8 years ago

How about excluding disambiguation pages in general from the named entity detection?

Nice catch. Our training data contains surface forms pointing to disambiguation. We will remove such cases from the training data and re-index DBpedia.

I just dont know how to identify a page as disambiguation page. Maybe disambiguation pages have the property dbo:wikiPageDisambiguates?

There is DBpedia partition dataset contaning only disambiguation pages http://downloads.dbpedia.org/2015-04/core/disambiguations_en.nt.bz2

We will use it to clean our training data. Note that, this dataset is not 100% valid. Its created based on heuristics since Wikipedia has no syntax to distinguish disambiguation links from ordinary links. But IMO it is of enough good for our case.

jnehring commented 8 years ago

I close this issue because it will be solved be freme-project/freme-ner#34

m1ci commented 8 years ago

yes, "NOT" will not be linked to http://dbpedia.org/resource/Not, but still "NOT" will be spotted as entity.