dnmilne / wikipediaminer

An open source toolkit for mining Wikipedia
130 stars 62 forks source link

Dump extractor failing on simple english #27

Closed rom1504 closed 9 years ago

rom1504 commented 9 years ago

I'm trying to apply https://github.com/dnmilne/wikipediaminer/wiki/Obtaining-wikipedia-data on the simple english dump and I'm getting these errors :

15/04/17 16:50:09 ERROR extraction.PageStep$Step1Mapper: Caught exception
java.lang.NullPointerException
        at org.wikipedia.miner.extraction.PageStep$Step1Mapper.map(Unknown Source)
        at org.wikipedia.miner.extraction.PageStep$Step1Mapper.map(Unknown Source)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

I'm using wikipedia miner 1.2 and hadoop 2.6 (and java7). with that command line hadoop jar wikipedia-miner-hadoop.jar org.wikipedia.miner.extraction.DumpExtractor input/simplewiki-latest-pages-articles.xml input/languages.xml simple input/en-sent.bin output

Any particular reason why I'm getting this error ?

Neuw84 commented 9 years ago

Have you tested your hadoop installation?. It's have been a while since I built the database from newer dumps but always have problems with hadoop.

rom1504 commented 9 years ago

Yes, the first example (Standalone Operation) on http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html works.

rom1504 commented 9 years ago

Should I use hadoop 2.6 or 1.2 ?

Neuw84 commented 9 years ago

You are using then hadoop 2.6 while Wikiminer use the 1.2 version. Maybe the problem is there. I followed this guide to setup hadoop on my linux machine and the last time it worked well. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

rom1504 commented 9 years ago

Using hadoop 1.2 seems to fix that problem, thanks !