dice-group / AGDISTIS

AGDISTIS - Agnostic Named Entity Disambiguation
http://aksw.org/Projects/AGDISTIS.html
GNU Affero General Public License v3.0
141 stars 37 forks source link

Handling entities with punctuations #53

Closed RicardoUsbeck closed 6 years ago

RicardoUsbeck commented 7 years ago

When I run a Chinese webservice using Java and query it I get:

curl --data-urlencode "text='<entity>???</entity>.'" -d type='agdistis' http://localhost:8080/AGDISTIS I get:

[{"disambiguatedURL":"http:\/\/aksw.org\/notInWiki\/???","offset":3,"namedEntity":"???","start":1}] And in the terminal window where the webservice is running I see an error:

17:31:08,377 ERROR [org.aksw.agdistis.util.TripleIndex] 143 - <Cannot parse '': Encountered "<EOF>" at line 1, column 0.
Was expecting one of:
   <NOT> ...
   "+" ...
   "-" ...
   <BAREOPER> ...
   "(" ...
   "*" ...
   <QUOTED> ...
   <TERM> ...
   <PREFIXTERM> ...
   <WILDTERM> ...
   <REGEXPTERM> ...
   "[" ...
   "{" ...
   <NUMBER> ...
   <TERM> ...
   "*" ...
    -> null>
Oct 05, 2017 5:31:08 PM org.restlet.engine.log.LogFilter afterHandle
INFO: 2017-10-05    17:31:08    0:0:0:0:0:0:0:1 -   0:0:0:0:0:0:0:18080 POST    /AGDISTIS   -   200 111 80  31  http://localhost:8080   curl/7.54.0 -
lguillou commented 7 years ago

@RicardoUsbeck Thank you for logging this bug.

I've experimented a little more. I set the following properties in agdistis.properties to point to Chinese DPBedia:

nodeType=http://zh.dbpedia.org/resource/ edgeType=http://zh.dbpedia.org/ontology/ baseURI =http://zh.dbpedia.org

If I run the example from the Wiki:

curl --data-urlencode "text='The shanghai in 北京市.'" -d type='agdistis' http://localhost:8080/AGDISTIS

the webservice returns the expected results:

[{"disambiguatedURL":"http:\/\/zh.dbpedia.org\/resource\/Shanghai","offset":8,"namedEntity":"shanghai","start":5},{"disambiguatedURL":"http:\/\/zh.dbpedia.org\/resource\/北京市","offset":3,"namedEntity":"北京市","start":17}]

However, running for examples with punctuation only ??? or ., or no entity still triggers the error in the terminal window. Running for examples that contain an entity plus punctuation, e.g. 北京市. does not trigger the error. Rather than the presence of punctuation, could it be the absence of an entity-like string that is the problem?

DiegoMoussallem commented 7 years ago

@RicardoUsbeck @lguillou This error comes from our stemming step ( see https://github.com/dice-group/AGDISTIS/blob/master/src/main/java/org/aksw/agdistis/util/Stemming.java#L91). Actually our preprocessing is able to deal with punctuations, but when AGDISTIS is not able to find any candidates in first the main search, it looks for more using surface forms search. However, if nothing is found again, it tries to stem the label then besides to stem the label this step also removes all punctuations.

RicardoUsbeck commented 7 years ago

I guess we will write a unit test for it and then try to find the bug. Thanks for the additional test.