Closed RicardoUsbeck closed 6 years ago
Done. Pls, check my folder on the Hobbit server.
However the current index contains some bugs
The bugs were due to the server haven't accepted the complete transfer of index's files at the first time. Pls, try again http://hobbitdata.informatik.uni-leipzig.de/agdistis/wikidata/index_wikidata_en.zip
Redo the index according to Andriy:
the dumps must be gathered from here:
For instance, the latest one is:
https://dumps.wikimedia.org/wikidatawiki/entities/20171120/wikidata-20171120-all-BETA.ttl.gz
We previously got from http://tools.wmflabs.org/wikidata-exports/rdf/exports/20160801/dump_download.html
. Although it is the official link in the Wikidata's website, it does not reflect the online version.
Will you redo it?
ofc. ;)
Is the new index ready for testing?
Not yet.
I have tried to create the Wikidata index, however, the file is really big which is more than 200GB. I tried to put in small parts but it did not work. Also, I parsed the big file but Lucene seems to have a limit number of documents. Look the error,
01:44:41,737 ERROR [org.aksw.agdistis.util.TripleIndexCreator] 133 - <Error while creating TripleIndex.>
java.lang.IllegalArgumentException: Too many documents, composite IndexReaders cannot exceed 2147483647
at org.apache.lucene.index.BaseCompositeReader.<init>(BaseCompositeReader.java:77)
at org.apache.lucene.index.DirectoryReader.<init>(DirectoryReader.java:368)
at org.apache.lucene.index.StandardDirectoryReader.<init>(StandardDirectoryReader.java:42)
at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:71)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
at org.aksw.agdistis.util.TripleIndexCreator.createIndex(TripleIndexCreator.java:131)
at org.aksw.agdistis.util.TripleIndexCreator.main(TripleIndexCreator.java:98)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
Any idea how to handle it? Any Suggestions? @yamalight @RicardoUsbeck
I get following error whith the wikidata dump. The dump seems to have invalid urls like /ww.nasa.gov/centers/kennedy/about/history/mercury7.html
org.openrdf.rio.RDFParseException: Not a valid (absolute) URI: /ww.nasa.gov/centers/kennedy/about/history/mercury7.html [line 24317426]
at org.openrdf.rio.helpers.RDFParserBase.reportFatalError(RDFParserBase.java:623)
at org.openrdf.rio.turtle.TurtleParser.reportFatalError(TurtleParser.java:1114)
at org.openrdf.rio.helpers.RDFParserBase.createURI(RDFParserBase.java:341)
at org.openrdf.rio.helpers.RDFParserBase.resolveURI(RDFParserBase.java:328)
at org.openrdf.rio.turtle.TurtleParser.parseURI(TurtleParser.java:855)
at org.openrdf.rio.turtle.TurtleParser.parseValue(TurtleParser.java:525)
at org.openrdf.rio.turtle.TurtleParser.parseObject(TurtleParser.java:413)
at org.openrdf.rio.turtle.TurtleParser.parseObjectList(TurtleParser.java:339)
at org.openrdf.rio.turtle.TurtleParser.parsePredicateObjectList(TurtleParser.java:332)
at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:301)
at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:208)
at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:186)
at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:131)
at org.aksw.agdistis.util.TripleIndexCreator.indexTTLFile(TripleIndexCreator.java:154)
at org.aksw.agdistis.util.TripleIndexCreator.createIndex(TripleIndexCreator.java:130)
at org.aksw.agdistis.util.TripleIndexCreator.main(TripleIndexCreator.java:103)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Not a valid (absolute) URI: /ww.nasa.gov/centers/kennedy/about/history/mercury7.html
at org.openrdf.model.impl.URIImpl.setURIString(URIImpl.java:68)
at org.openrdf.model.impl.URIImpl.<init>(URIImpl.java:57)
at org.openrdf.model.impl.ValueFactoryImpl.createURI(ValueFactoryImpl.java:38)
at org.openrdf.rio.helpers.RDFParserBase.createURI(RDFParserBase.java:338)
Yes, @vdanielupb yes, the dump contains some problems. We are working on it, you can also download a new version of the dump from https://dumps.wikimedia.org/wikidatawiki/entities/.
Fixed @RicardoUsbeck and @vdanielupb see your email.
@mejohnee Can you check the new index?
May I close this issue?
Create an index for disambiguation to wikidata en