NativeTask is a native engine inside Hadoop MapReduce(MR) Task written in C++ and focuses on task performance optimization, while leaving the scheduling and communication job to the MR framework.
NativeTask could be used in two modes:
For the first mode, there is little user work needed other than turning on a option and users could run their Java MapReduce job transparently. For the second mode, users will need to write MapReduce jobs in C/C++.
NativeTask feature list:
We found MapReduce slow for the following reasons:
NativeTask solves the above issues and is faster because:
Here is the diagram of NativeTask Performance improvement (native MapOutputCollector mode) against Hadoop original.
NativeTask is 2x faster further in full native mode.
In MRv1, please set mapreduce.map.output.collector.delegator.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator
in JobConf. For example, to run Pi with native MapOutputCollector
hadoop jar hadoop-examples.jar pi -D mapreduce.map.output.collector.delegator.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator 10 10
MRv2 supports pluggable MapOutputCollector. Set mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator
in JobConf. Now the Pi example could be run with native MapOutputCollector as
hadoop jar hadoop-mapreduce-examples.jar pi -D mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator 10 10
In both MRv1 and MRv2, please check the task log, if there is
INFO org.apache.hadoop.mapred.nativetask.NativeMapOutputCollectorDelegator: Native output collector can be successfully enabled!
Then NativeTask is successfully enabled.
MAPREDUCE-2841 discusses about some initial experiment in "task level native optimization" while our implementation comes with far more advanced features (e.g. more key types support, Java combiner support) and has been used and verified in production environment.