Pipes rewrite - Githubissues

simleo commented 5 years ago

Fixes #253. Fixes #268. Fixes #269. Fixes #319.

Rewrites the mapreduce section from the ground up.

Rationale

Over the years the code got pretty garbled and hard to maintain, especially when we started supporting Hadoop 2 and the (then) new hadoop.mapreduce, and when we added Python 3 support.
Several features we added saw practically zero use, despite requiring maintenance just like everything else. Even worse, sometimes the code had to get more complicated just to enable these features, at the expense of core functionality.
Specifically, the sercore extension got rather bloated and was pestered by hard-to-squish bugs like #319.

Actions

Highlights only, see the code for further details :)

The sercore extension has been completely rewritten and is now much simpler. It consists of two main objects (a reader and a writer) that know how to serialize and deserialize the Java types used within the pipes protocol (plus a top-level function to deserialize FileSplits), while everything else is handled in Python code. The new sercore clearly handles the bytes vs string dichotomy so that upper levels don't need to figure out whether encoding/decoding is needed. We are now using C++ 11 smart pointers, so it should be much harder to run into issues like #319.
The simulator has been dropped. This is the most notable of the aforementioned seldom-used features. Due to significant changes in how/where things are handled in mapreduce.pipes, if we want to resurrect this in the future, it's probably best to rewrite it from scratch anyway.
The text-based protocol has been dropped, also due to lack of usage.
The context_class arg to run_task has been dropped. Its only realistic use case was enabling Avro I/O, but now that's automatic (triggered by the relevant keys in the job conf), i.e., the user only needs to request Avro I/O via the submitter.
The new int_test/mapred_submitter section provides a wide array of integration tests that check many of the possible I/O scenarios while using the official mapred.pipes submitter. We should probably expand the examples/pydoop_submit checks (which use our submitter instead) in the same way (and move them to int_test, merging with int_test/progress).
OpaqueSplit has been cleaned up and integrated into the pipes framework, meaning it's now auto-deserialized (if the relevant key is found in the job conf) and made accessible via context.input_split.payload.

Performance

I haven't performed a proper performance comparison between the old and the new implementation, but I have the following numbers from a couple of quick runs on (docker on) my laptop, showing the time (in seconds) it takes to run word count on 100 MB of data (two mappers, two reducers, "mapreduce.task.io.sort.mb" set to 10) in several scenarios (see int_test/mapred_submitter/run_perf.sh). Note that the tenths figure is always zero because it's not meaningful.

	old	new
combiner (*)	70/110	30
java_rw	100	90
python_partitioner	120	110
python_reader	100	90
python_writer	100	90
raw_io	70	70

(*) Respectively with and without fast_combiner, does not apply to the new version.

Everything looks either in line with the old implementation or slightly better (much better when using the combiner).

simleo commented 5 years ago

@elzaggo yes. More generally, using mutable types with a combiner is asking for trouble. Documented in 149aa06

elzaggo commented 5 years ago

Approved.

crs4 / pydoop

Pipes rewrite #328

Rationale

Actions

Performance