Rewrites the mapreduce section from the ground up.
Rationale
Over the years the code got pretty garbled and hard to maintain, especially when we started supporting Hadoop 2 and the (then) new hadoop.mapreduce, and when we added Python 3 support.
Several features we added saw practically zero use, despite requiring maintenance just like everything else. Even worse, sometimes the code had to get more complicated just to enable these features, at the expense of core functionality.
Specifically, the sercore extension got rather bloated and was pestered by hard-to-squish bugs like #319.
Actions
Highlights only, see the code for further details :)
The sercore extension has been completely rewritten and is now much simpler. It consists of two main objects (a reader and a writer) that know how to serialize and deserialize the Java types used within the pipes protocol (plus a top-level function to deserialize FileSplits), while everything else is handled in Python code. The new sercore clearly handles the bytes vs string dichotomy so that upper levels don't need to figure out whether encoding/decoding is needed. We are now using C++ 11 smart pointers, so it should be much harder to run into issues like #319.
The simulator has been dropped. This is the most notable of the aforementioned seldom-used features. Due to significant changes in how/where things are handled in mapreduce.pipes, if we want to resurrect this in the future, it's probably best to rewrite it from scratch anyway.
The text-based protocol has been dropped, also due to lack of usage.
The context_class arg to run_task has been dropped. Its only realistic use case was enabling Avro I/O, but now that's automatic (triggered by the relevant keys in the job conf), i.e., the user only needs to request Avro I/O via the submitter.
The new int_test/mapred_submitter section provides a wide array of integration tests that check many of the possible I/O scenarios while using the official mapred.pipes submitter. We should probably expand the examples/pydoop_submit checks (which use our submitter instead) in the same way (and move them to int_test, merging with int_test/progress).
OpaqueSplit has been cleaned up and integrated into the pipes framework, meaning it's now auto-deserialized (if the relevant key is found in the job conf) and made accessible via context.input_split.payload.
Performance
I haven't performed a proper performance comparison between the old and the new implementation, but I have the following numbers from a couple of quick runs on (docker on) my laptop, showing the time (in seconds) it takes to run word count on 100 MB of data (two mappers, two reducers, "mapreduce.task.io.sort.mb" set to 10) in several scenarios (see int_test/mapred_submitter/run_perf.sh). Note that the tenths figure is always zero because it's not meaningful.
old
new
combiner (*)
70/110
30
java_rw
100
90
python_partitioner
120
110
python_reader
100
90
python_writer
100
90
raw_io
70
70
(*) Respectively with and without fast_combiner, does not apply to the new version.
Everything looks either in line with the old implementation or slightly better (much better when using the combiner).
Fixes #253. Fixes #268. Fixes #269. Fixes #319.
Rewrites the mapreduce section from the ground up.
Rationale
Over the years the code got pretty garbled and hard to maintain, especially when we started supporting Hadoop 2 and the (then) new
hadoop.mapreduce
, and when we added Python 3 support.Several features we added saw practically zero use, despite requiring maintenance just like everything else. Even worse, sometimes the code had to get more complicated just to enable these features, at the expense of core functionality.
Specifically, the
sercore
extension got rather bloated and was pestered by hard-to-squish bugs like #319.Actions
Highlights only, see the code for further details :)
The
sercore
extension has been completely rewritten and is now much simpler. It consists of two main objects (a reader and a writer) that know how to serialize and deserialize the Java types used within the pipes protocol (plus a top-level function to deserializeFileSplit
s), while everything else is handled in Python code. The newsercore
clearly handles the bytes vs string dichotomy so that upper levels don't need to figure out whether encoding/decoding is needed. We are now using C++ 11 smart pointers, so it should be much harder to run into issues like #319.The simulator has been dropped. This is the most notable of the aforementioned seldom-used features. Due to significant changes in how/where things are handled in
mapreduce.pipes
, if we want to resurrect this in the future, it's probably best to rewrite it from scratch anyway.The text-based protocol has been dropped, also due to lack of usage.
The
context_class
arg torun_task
has been dropped. Its only realistic use case was enabling Avro I/O, but now that's automatic (triggered by the relevant keys in the job conf), i.e., the user only needs to request Avro I/O via the submitter.The new
int_test/mapred_submitter
section provides a wide array of integration tests that check many of the possible I/O scenarios while using the officialmapred.pipes
submitter. We should probably expand theexamples/pydoop_submit
checks (which use our submitter instead) in the same way (and move them toint_test
, merging withint_test/progress
).OpaqueSplit
has been cleaned up and integrated into the pipes framework, meaning it's now auto-deserialized (if the relevant key is found in the job conf) and made accessible viacontext.input_split.payload
.Performance
I haven't performed a proper performance comparison between the old and the new implementation, but I have the following numbers from a couple of quick runs on (docker on) my laptop, showing the time (in seconds) it takes to run word count on 100 MB of data (two mappers, two reducers,
"mapreduce.task.io.sort.mb"
set to 10) in several scenarios (seeint_test/mapred_submitter/run_perf.sh
). Note that the tenths figure is always zero because it's not meaningful.(*) Respectively with and without
fast_combiner
, does not apply to the new version.Everything looks either in line with the old implementation or slightly better (much better when using the combiner).