dice-group / AGDISTIS

AGDISTIS - Agnostic Named Entity Disambiguation
http://aksw.org/Projects/AGDISTIS.html
GNU Affero General Public License v3.0
140 stars 37 forks source link

Failing disambiguation with german index #75

Closed redadmiral closed 5 years ago

redadmiral commented 5 years ago

I've experienced some problems using the german index data.

First eight unit tests fail if I run maven with the clean package arguments:

testSurfaceForm(TripleIndexTest)
  testRedirects(TripleIndexTest)
  testRdfsLabel(TripleIndexTest)
  testType(TripleIndexTest)
  testDisambiguation(TripleIndexTest)
  testdirectLink(TripleIndexTest)
  testUmlaute(AGDISTISTest)
  testMinimalExample(AGDISTISTest)

If I start the webservice without unit testing and query url --data-urlencode "text='Die <entity>Freie Universität Berlin</entity> in <entity>Newcastle</entity>.'" -d type=agdistis localhost:8080/AGDISTIS AGDISTIS returns only Identifier with notInWiki-prefix: [{"disambiguatedURL":"http:\/\/aksw.org\/notInWiki\/FreieUniversitätBerlin","offset":24,"namedEntity":"Freie Universität Berlin","start":5}, "disambiguatedURL":"http:\/\/aksw.org\/notInWiki\/Newcastle","offset":9,"namedEntity":"Newcastle","start":33}].

With the english index data everything works like a charm and entities are resolved to the correct dbpedia entries. The output of the failing unit tests can be found here.

DiegoMoussallem commented 5 years ago

Hi @redadmiral , could you paste here your agdistis.properties file?

Best

redadmiral commented 5 years ago

Hi @DiegoMoussallem, thanks for the quick answer!

#path to decompressed lucene 4.4 index

index=index/de

index_bycontext=index/de/context

#used to prune edges
nodeType=http://dbpedia.org/resource/
edgeType=http://dbpedia.org/ontology/
baseURI =http://dbpedia.org
#SPARQL endpoint to retrieve domain and range information
endpoint=http://dbpedia.org/sparql
#this is the trigram distance between words, default = 3
ngramDistance=3
#exploration depth of semantic disambiguation graph
maxDepth=2
#threshold for cutting of similar strings
threshholdTrigram=0.87
#heuristicExpansionOn explains whether simple coocurence resolution is done or not, e.g., Barack => Barack Obama if both are in the same text
heuristicExpansionOn=true
#list of entity domains and corporationAffixes
whiteList=/config/whiteList.txt
corporationAffixes=/config/corporationAffixes.txt

#Active popularity
popularity=false

#Choose an graph-based algorithm "hits" or "pagerank"
algorithm=hits

#Enable search by context
context=false

#Enable search by acronym
acronym=false

#Enable to find common entities
commonEntities=false

# IMPORTANT for creating an own index
folderWithTTLFiles=data/en
surfaceFormTSV=data/en/surface/en_surface_forms.tsv
DiegoMoussallem commented 5 years ago

Hi @redadmiral ,

Please change the following aspects of this file and see if it helps

from

nodeType=http://dbpedia.org/resource/
edgeType=http://dbpedia.org/ontology/
baseURI =http://dbpedia.org
#SPARQL endpoint to retrieve domain and range information
endpoint=http://dbpedia.org/sparql

to:

nodeType=http://de.dbpedia.org/resource/
edgeType=http://dbpedia.org/ontology/
baseURI =http://de.dbpedia.org
#SPARQL endpoint to retrieve domain and range information
endpoint=http://de.dbpedia.org/sparql
redadmiral commented 5 years ago

Thanks a lot, this did the trick! The unit tests are still failing but the queries are correctly disambiguated. Thanks a lot for the help!

DiegoMoussallem commented 5 years ago

Nice :)

The tests are only for English actually, we are considering to create tests for all languages.