dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
856 stars 269 forks source link

Weird error when running stats-extraction #370

Closed jvican closed 9 years ago

jvican commented 9 years ago

I was following this guide. After downloading the latest english wikipedia, I got the following error.

WARNING: wrong redirect. page: [title=Marvin Pentz Gay Sr;ns=0/Main/;language:wiki=en,locale=en].
found by dbpedia:   [title=Marvin Gay, Sr.;ns=0/Main/;language:wiki=en,locale=en].
found by wikipedia: [title=Marvin Gay Sr.;ns=0/Main/;language:wiki=en,locale=en]
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:483)
    at scala_maven_executions.MainHelper.runMain(MainHelper.java:164)
    at scala_maven_executions.MainWithArgsInFile.main(MainWithArgsInFile.java:26)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 16892167
    at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:874)
    at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:933)
    at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:228)
    at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:179)
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1762)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1435)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2807)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:558)
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.getElementText(XMLStreamReaderImpl.java:862)
    at org.dbpedia.extraction.sources.WikipediaDumpParser.readString(WikipediaDumpParser.java:395)
    at org.dbpedia.extraction.sources.WikipediaDumpParser.readRevision(WikipediaDumpParser.java:290)
    at org.dbpedia.extraction.sources.WikipediaDumpParser.readPage(WikipediaDumpParser.java:248)
    at org.dbpedia.extraction.sources.WikipediaDumpParser.readPages(WikipediaDumpParser.java:187)
    at org.dbpedia.extraction.sources.WikipediaDumpParser.readDump(WikipediaDumpParser.java:145)
    at org.dbpedia.extraction.sources.WikipediaDumpParser.run(WikipediaDumpParser.java:116)
    at org.dbpedia.extraction.sources.XMLReaderSource.foreach(XMLSource.scala:112)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
    at org.dbpedia.extraction.sources.XMLReaderSource.flatMap(XMLSource.scala:108)
    at org.dbpedia.extraction.mappings.Redirects$.loadFromSource(Redirects.scala:171)
    at org.dbpedia.extraction.mappings.Redirects$.load(Redirects.scala:122)
    at org.dbpedia.extraction.dump.extract.ConfigLoader$$anon$1.<init>(ConfigLoader.scala:101)
    at org.dbpedia.extraction.dump.extract.ConfigLoader.org$dbpedia$extraction$dump$extract$ConfigLoader$$createExtractionJob(ConfigLoader.scala:53)
    at org.dbpedia.extraction.dump.extract.ConfigLoader$$anonfun$getExtractionJobs$1.apply(ConfigLoader.scala:40)
    at org.dbpedia.extraction.dump.extract.ConfigLoader$$anonfun$getExtractionJobs$1.apply(ConfigLoader.scala:40)
    at scala.collection.TraversableViewLike$Mapped$$anonfun$foreach$2.apply(TraversableViewLike.scala:169)
    at scala.collection.Iterator$class.foreach(Iterator.scala:743)
    at scala.collection.immutable.RedBlackTree$TreeIterator.foreach(RedBlackTree.scala:468)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:310)
    at scala.collection.TraversableViewLike$Mapped$class.foreach(TraversableViewLike.scala:168)
    at scala.collection.IterableViewLike$$anon$3.foreach(IterableViewLike.scala:113)
    at org.dbpedia.extraction.dump.extract.Extraction$.main(Extraction.scala:30)
    at org.dbpedia.extraction.dump.extract.Extraction.main(Extraction.scala)
    ... 6 more
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 20:34 min
[INFO] Finished at: 2015-03-29T21:43:12+02:00
[INFO] Final Memory: 11M/203M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:run (default-cli) on project dump: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 240 (Exit value: 240) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

This error occurred when executing:

../run stats-extraction extraction.stats.properties

And my extraction.stats.properties is:

# download and extraction target dir
base-dir=/Users/turing/extraction-framework/dump/

# Source file. If source file name ends with .gz or .bz2, it is unzipped on the fly.
# Must exist in the directory xxwiki/yyyymmdd and have the prefix xxwiki-yyyymmdd-
# where xx is the wiki code and yyyymmdd is the dump date.

# default:
source=pages-articles.xml.bz2

# alternatives:
# source=pages-articles.xml.gz
# source=pages-articles.xml

# use only directories that contain a 'download-complete' file? Default is false.
require-download-complete=true

# List of languages or article count ranges, e.g. 'en,de,fr' or '10000-20000' or '10000-', or '@mappings'
languages=en

# extractor class names starting with "." are prefixed by "org.dbpedia.extraction.mappings"

extractors=.RedirectExtractor,.InfoboxExtractor,.TemplateParameterExtractor

# Use IRIs and all local URIs (even en.dbpedia.org). Stats builder can't handle generic domains.
uri-policy.default=reject-long:*
format.ttl.gz=turtle-triples;uri-policy.default

# if ontology and mapping files are not given or do not exist, download info from mappings.dbpedia.org
ontology=../ontology.xml
mappings=../mappings
jimkont commented 9 years ago

Can you test the wikipedia dump with bzip2 --test <dump-file> ?

jvican commented 9 years ago

I have executed it and after a while anything appears in the screen. It looks like the dump is ok. What do you think? @jimkont

jimkont commented 9 years ago

The stack trace indicates a decompression error, can you download the dump again and re-try ?

jimkont commented 9 years ago

@jvican any progress on this?

jvican commented 9 years ago

Sorry @jimkont, I am very busy and I couldn't do a thing. I hope this summer I can, I will do this and read all the source code if I have time. But, at this moment, please don't count on me.

jimkont commented 9 years ago

No worries, feel free to reopen this if this persists