crs4 / pydoop

A Python MapReduce and HDFS API for Hadoop
Apache License 2.0
236 stars 59 forks source link

Pipes rewrite #328

Closed simleo closed 5 years ago

simleo commented 5 years ago

Fixes #253. Fixes #268. Fixes #269. Fixes #319.

Rewrites the mapreduce section from the ground up.

Rationale

Actions

Highlights only, see the code for further details :)

Performance

I haven't performed a proper performance comparison between the old and the new implementation, but I have the following numbers from a couple of quick runs on (docker on) my laptop, showing the time (in seconds) it takes to run word count on 100 MB of data (two mappers, two reducers, "mapreduce.task.io.sort.mb" set to 10) in several scenarios (see int_test/mapred_submitter/run_perf.sh). Note that the tenths figure is always zero because it's not meaningful.

old new
combiner (*) 70/110 30
java_rw 100 90
python_partitioner 120 110
python_reader 100 90
python_writer 100 90
raw_io 70 70

(*) Respectively with and without fast_combiner, does not apply to the new version.

Everything looks either in line with the old implementation or slightly better (much better when using the combiner).

simleo commented 5 years ago

@elzaggo yes. More generally, using mutable types with a combiner is asking for trouble. Documented in 149aa06

elzaggo commented 5 years ago

Approved.