dnmilne / wikipediaminer

An open source toolkit for mining Wikipedia
130 stars 62 forks source link

exception from DumpExtractor with Russian or Dutch articles dump #30

Open expert-fb opened 8 years ago

expert-fb commented 8 years ago

my environment: $ lsb_release -a LSB Version: core-9.20160110ubuntu0.2-amd64:core-9.20160110ubuntu0.2-noarch:security-9.20160110ubuntu0.2-amd64:security-9.20160110ubuntu0.2-noarch Distributor ID: Ubuntu Description: Ubuntu 16.04.1 LTS Release: 16.04 Codename: xenial

$ hadoop version Hadoop 2.2.0 Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0 From source with checksum 79e53ce7994d1628b240f09af91e1af4 This command was run using /usr/local/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar


I followed the recommended steps to generate a database from a wikipedia articles dump. I run into the following java exception when trying to extract a links db from the Russian wikipedia articles dump:

Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop-2. 2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will tr y to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'. 16/11/06 21:12:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/11/06 21:12:28 INFO extraction.DumpExtractor: Extracting site info 16/11/06 21:12:28 INFO extraction.DumpExtractor: Starting page step 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator 16/11/06 21:12:28 INFO Configuration.deprecation: session.id is deprecated. Instead, us e dfs.metrics.session-id 16/11/06 21:12:28 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTra cker, sessionId= 16/11/06 21:12:28 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=J obTracker, sessionId= - already initialized 16/11/06 21:12:28 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not p erformed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16/11/06 21:12:28 INFO mapred.FileInputFormat: Total input paths to process : 1 16/11/06 21:12:28 INFO mapreduce.JobSubmitter: number of splits:519 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.job.name is deprecated. Instea d, use mapreduce.job.name 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.cache.files.timestamps is depr ecated. Instead, use mapreduce.job.cache.files.timestamps 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.input.dir is deprecated. Inste ad, use mapreduce.input.fileinputformat.inputdir 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.output.value.class is deprecat ed. Instead, use mapreduce.job.output.value.class 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, us e mapreduce.job.jar 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.output.dir is deprecated. Inst ead, use mapreduce.output.fileoutputformat.outputdir 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.cache.files is deprecated. Ins tead, use mapreduce.job.cache.files 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.working.dir is deprecated. Ins tead, use mapreduce.job.working.dir 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Inste ad, use mapreduce.job.maps 16/11/06 21:12:28 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. In stead, use mapreduce.job.reduces 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.cache.files.filesizes is depre cated. Instead, use mapreduce.job.cache.files.filesizes 16/11/06 21:12:28 INFO Configuration.deprecation: mapred.output.key.class is deprecated . Instead, use mapreduce.job.output.key.class 16/11/06 21:12:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1121 545400_0001 16/11/06 21:12:28 WARN conf.Configuration: file:/tmp/hadoop-beil/mapred/staging/beil112 1545400/.staging/job_local1121545400_0001/job.xml:an attempt to override final paramete r: mapreduce.job.end-notification.max.retry.interval; Ignoring. 16/11/06 21:12:28 WARN conf.Configuration: file:/tmp/hadoop-beil/mapred/staging/beil112 1545400/.staging/job_local1121545400_0001/job.xml:an attempt to override final paramete r: mapreduce.job.end-notification.max.attempts; Ignoring. 16/11/06 21:12:29 INFO mapred.LocalDistributedCacheManager: Localized file:/playground/ franz/projects/WPS/output/final/siteInfo.xml as file:/tmp/hadoop-beil/mapred/local/1478 463148986/siteInfo.xml 16/11/06 21:12:29 INFO mapred.LocalDistributedCacheManager: Localized file:/playground/ franz/projects/WPS/input/languages.xml as file:/tmp/hadoop-beil/mapred/local/1478463148 987/languages.xml 16/11/06 21:12:29 INFO Configuration.deprecation: mapred.cache.localFiles is deprecated. Instead, use mapreduce.job.cache.local.files 16/11/06 21:12:29 WARN conf.Configuration: file:/tmp/hadoop-beil/mapred/local/localRunner/beil/job_local1121545400_0001/job_local1121545400_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 16/11/06 21:12:29 WARN conf.Configuration: file:/tmp/hadoop-beil/mapred/local/localRunner/beil/job_local1121545400_0001/job_local1121545400_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 16/11/06 21:12:29 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 16/11/06 21:12:29 INFO mapred.LocalJobRunner: OutputCommitter set in config null 16/11/06 21:12:29 INFO mapreduce.Job: Running job: job_local1121545400_0001 16/11/06 21:12:29 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter 16/11/06 21:12:29 INFO mapred.LocalJobRunner: Waiting for map tasks 16/11/06 21:12:29 INFO mapred.LocalJobRunner: Starting task: attempt_local1121545400_0001_m_000000_0 16/11/06 21:12:29 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 16/11/06 21:12:29 INFO mapred.MapTask: Processing split: file:/playground/franz/projects/WPS/input/ruwiki-latest-pages-articles.xml:0+33554432 16/11/06 21:12:29 INFO mapred.MapTask: numReduceTasks: 1 16/11/06 21:12:29 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 16/11/06 21:12:29 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 16/11/06 21:12:29 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 16/11/06 21:12:29 INFO mapred.MapTask: soft limit at 83886080 16/11/06 21:12:29 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 16/11/06 21:12:29 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 16/11/06 21:12:29 ERROR extraction.PageStep$Step1Mapper: Could not configure mapper java.io.FileNotFoundException: file:/tmp/hadoop-beil/mapred/local/1478463148986/siteInfo.xml (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at java.io.FileInputStream.(FileInputStream.java:93) at java.io.FileReader.(FileReader.java:58) at org.wikipedia.miner.extraction.SiteInfo.(SiteInfo.java:30) at org.wikipedia.miner.extraction.PageStep$Step1Mapper.configure(PageStep.java:132) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:425) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/11/06 21:12:29 ERROR extraction.PageStep$Step1Mapper: Caught exception java.lang.NullPointerException at org.wikipedia.miner.extraction.PageStep$Step1Mapper.map(PageStep.java:168) at org.wikipedia.miner.extraction.PageStep$Step1Mapper.map(PageStep.java:109) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

It seems someone had a similar problem with the Dutch wikipedia articles dump: http://pastebin.com/uhpXwnTi

I'd appreciate it very much if someone could tell me about a workaround.