dice-group / AGDISTIS

AGDISTIS - Agnostic Named Entity Disambiguation
http://aksw.org/Projects/AGDISTIS.html
GNU Affero General Public License v3.0
140 stars 37 forks source link

Wikidata #51

Closed RicardoUsbeck closed 6 years ago

RicardoUsbeck commented 6 years ago

Create an index for disambiguation to wikidata en

DiegoMoussallem commented 6 years ago

Done. Pls, check my folder on the Hobbit server.

RicardoUsbeck commented 6 years ago

See https://github.com/dice-group/AGDISTIS/wiki/6-Using-Wikidata-as-KB

RicardoUsbeck commented 6 years ago

However the current index contains some bugs

DiegoMoussallem commented 6 years ago

The bugs were due to the server haven't accepted the complete transfer of index's files at the first time. Pls, try again http://hobbitdata.informatik.uni-leipzig.de/agdistis/wikidata/index_wikidata_en.zip

DiegoMoussallem commented 6 years ago

Redo the index according to Andriy:

the dumps must be gathered from here:

https://dumps.wikimedia.org/wikidatawiki/entities/

For instance, the latest one is:

https://dumps.wikimedia.org/wikidatawiki/entities/20171120/wikidata-20171120-all-BETA.ttl.gz

We previously got from http://tools.wmflabs.org/wikidata-exports/rdf/exports/20160801/dump_download.html. Although it is the official link in the Wikidata's website, it does not reflect the online version.

RicardoUsbeck commented 6 years ago

Will you redo it?

DiegoMoussallem commented 6 years ago

ofc. ;)

RicardoUsbeck commented 6 years ago

Is the new index ready for testing?

DiegoMoussallem commented 6 years ago

Not yet.

DiegoMoussallem commented 6 years ago

I have tried to create the Wikidata index, however, the file is really big which is more than 200GB. I tried to put in small parts but it did not work. Also, I parsed the big file but Lucene seems to have a limit number of documents. Look the error,

01:44:41,737 ERROR [org.aksw.agdistis.util.TripleIndexCreator] 133 - <Error while creating TripleIndex.>
java.lang.IllegalArgumentException: Too many documents, composite IndexReaders cannot exceed 2147483647
    at org.apache.lucene.index.BaseCompositeReader.<init>(BaseCompositeReader.java:77)
    at org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:368)
    at org.apache.lucene.index.StandardDirectoryReader.<init>(StandardDirectoryReader.java:42)
    at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:71)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
    at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
    at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
    at org.aksw.agdistis.util.TripleIndexCreator.createIndex(TripleIndexCreator.java:131)
    at org.aksw.agdistis.util.TripleIndexCreator.main(TripleIndexCreator.java:98)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
    at java.lang.Thread.run(Thread.java:748)

Any idea how to handle it? Any Suggestions? @yamalight @RicardoUsbeck

vdanielupb commented 6 years ago

I get following error whith the wikidata dump. The dump seems to have invalid urls like /ww.nasa.gov/centers/kennedy/about/history/mercury7.html

org.openrdf.rio.RDFParseException: Not a valid (absolute) URI: /ww.nasa.gov/centers/kennedy/about/history/mercury7.html [line 24317426]
        at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:623)
        at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1114)
        at org.openrdf.rio.helpers.RDFParserBase.createURI(RDFParserBase.java:341)
        at org.openrdf.rio.helpers.RDFParserBase.resolveURI(RDFParserBase.java:328)
        at org.openrdf.rio.turtle.TurtleParser.parseURI(TurtleParser.java:855)
        at org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:525)
        at org.openrdf.rio.turtle.TurtleParser.parseObject(TurtleParser.java:413)
        at org.openrdf.rio.turtle.TurtleParser.parseObjectList(TurtleParser.java:339)
        at org.openrdf.rio.turtle.TurtleParser.parsePredicateObjectList(TurtleParser.java:332)
        at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:301)
        at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:208)
        at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:186)
        at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:131)
        at org.aksw.agdistis.util.TripleIndexCreator.indexTTLFile(TripleIndexCreator.java:154)
        at org.aksw.agdistis.util.TripleIndexCreator.createIndex(TripleIndexCreator.java:130)
        at org.aksw.agdistis.util.TripleIndexCreator.main(TripleIndexCreator.java:103)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Not a valid (absolute) URI: /ww.nasa.gov/centers/kennedy/about/history/mercury7.html
        at org.openrdf.model.impl.URIImpl.setURIString(URIImpl.java:68)
        at org.openrdf.model.impl.URIImpl.<init>(URIImpl.java:57)
        at org.openrdf.model.impl.ValueFactoryImpl.createURI(ValueFactoryImpl.java:38)
        at org.openrdf.rio.helpers.RDFParserBase.createURI(RDFParserBase.java:338)
DiegoMoussallem commented 6 years ago

Yes, @vdanielupb yes, the dump contains some problems. We are working on it, you can also download a new version of the dump from https://dumps.wikimedia.org/wikidatawiki/entities/.

DiegoMoussallem commented 6 years ago

Fixed @RicardoUsbeck and @vdanielupb see your email.

RicardoUsbeck commented 6 years ago

@mejohnee Can you check the new index?

DiegoMoussallem commented 6 years ago

May I close this issue?