fingltd / 4mc

4mc - splittable lz4 and zstd in hadoop/spark/flink
Other
108 stars 37 forks source link

NullPointerException after reading first file #14

Closed mikcox closed 8 years ago

mikcox commented 8 years ago

Hello,

I'm trying to use hadoop-4mc with my AWS EMR cluster and am using a product called Hunk to interface with the cluster. Whenever I run a search job in Hunk, my results for the first file are returned fine but at the end of reading the first tile I get a NullPointerException with a stack trace as shown below.

Any idea what might be causing this? Let me know if you need any additional information.

(This is with Hadoop 2.7.2-amzn-3 and hadoop-4mc-1.4.0.)

2016-08-13 15:38:38,547 FATAL [IPC Server handler 44 on 44360] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1470333745198_0062_m_000000_1 - exited : java.lang.NullPointerException at com.hadoop.compression.fourmc.Lz4Decompressor.reset(Lz4Decompressor.java:234) at org.apache.hadoop.io.compress.CodecPool.returnDecompressor(CodecPool.java:224) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.close(LineRecordReader.java:235) at com.splunk.mr.input.SplunkLineRecordReader.close(SplunkLineRecordReader.java:21) at com.splunk.mr.SplunkBaseMapper$RecReader.close(SplunkBaseMapper.java:246) at com.splunk.mr.SplunkBaseMapper.runImpl(SplunkBaseMapper.java:305) at com.splunk.mr.SplunkSearchMapper.runImpl(SplunkSearchMapper.java:419) at com.splunk.mr.SplunkBaseMapper.run(SplunkBaseMapper.java:164) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:796) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

carlomedas commented 8 years ago

Please try with FourMcTextInputFormat and not the hadoop builtin one. You can find an example in src/java/examples folder.

I'm out for vacation so in case no fix I will be able to have a look at code in a week or so.

Moreover so far tested only with hadoop up to 2.6.x but but sure it is relevant.

mikcox commented 8 years ago

Ah, definitely.

The tool I'm using (Hunk) allows me to specify the Hadoop Record Reader class that I want to be using and a regex for the filetypes on which to use that record reader.

In order to use this, the record reader just needs a getName() method (which returns the name of the record reader) and an optional getFilePattern() method (which returns the regex pattern that the record reader will accept).

I've added a pull request dac2f3a which implements these methods on your RecordReader class and should allow users more flexibility and usability of this RecordReader.

mikcox commented 8 years ago

Okay, so building a new jar with the above changes is partially working, but still fails when I run a mapreduce job. It looks like it's still trying to use the default org.apache.hadoop.util.lineReader:

exited : java.lang.NullPointerException at com.hadoop.compression.fourmc.FourMcInputStream.close(FourMcInputStream.java:341) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at com.hadoop.mapreduce.FourMcLineRecordReader.close(FourMcLineRecordReader.java:106) at com.splunk.mr.input.SplunkRecordReaderWrapper.close(SplunkRecordReaderWrapper.java:58) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:532) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:800) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

I'm guessing that whatever I need to do to fix this is trivial, but I'm not quite familiar enough to have a good grip on what exactly is failing. My gut feeling is that I need to configure the mapreduce.job.inputFormat.class, but I haven't yet seen a place to do that which I have access to. Let me know if you have any thoughts.

carlomedas commented 8 years ago

In first try of original post, it seems to me you have been trying to use LZ4 compression codec w/o 4mc format.

While in 2nd case, following my advice, you tried to use the 4mc format and related classes. In this 2nd case, how did you compress the data?

Because of JNI direct buffers, at close() the decompressor is forced to release direct buffers and then set to null; here it's like it's happening twice but I'm not sure if it's a bug or a problem on how Splunk wrapper handles the underlying record reader, if it calls it twice then it's reason of issue. I'll have a look at Splunk/Hunk when I get chance but you could try to debug with something like adding as first statement of FourMcInputStream.close(): if (decompressor==null) return;

You could also add logs to check the workflow.

mikcox commented 8 years ago

In both cases, I was compressing the files with the included 4mc command-line tool (on a linux box).

After adding the decompressor == null check in FourMCInputStream.close(), I am no longer getting an error and everything seems to be working great! I can now run MapReduce jobs on .4mc compressed files.

I've added that change to the pull request that I have opened related to this issue. Let me know if that's something that makes sense to add to the core.

Cheers and thanks a ton for your help!

carlomedas commented 8 years ago

Perfect, I just accepted your pull request.

Side question: is Hunk a paid software or is there some trial so I can myself give it a try? No problem in case it's available only as paid software, I was just curious to have a look.

mikcox commented 8 years ago

You can download and play with a free trial for 60 days: https://www.splunk.com/en_us/download/hunk.html (after the 60 days it becomes paid software). You should definitely be able to download it and take a look.

carlomedas commented 8 years ago

Thanks!