crs4 / pydoop

A Python MapReduce and HDFS API for Hadoop
Apache License 2.0
237 stars 59 forks source link

Allow input split definition from Python #286

Closed elzaggo closed 6 years ago

elzaggo commented 6 years ago

This PR is expected to provide a solution to #265. It modifies it.crs4.pydoop.mapreduce.pipes.PipesNonJavaInputFormat so that PipesNonJavaInputFormat().getSplits(..) will recover InputSplits either: by invoking the getSplits method of the 'actual' InputFormat specified by the user in mapreduce.pipes.inputformat; or reading the contents of the hdfs file specificed by the uri mapreduce.pipes.external-splits.uri if mapreduce.pipes.external-splits.enabled has been set to true. The file specified by mapreduce.pipes.external_splits.path should have the structure

WritableInt n OpaqueObject1 OpaqueObject1 ..OpaqueObject n
simleo commented 6 years ago

Looks good. Now we need a Python-side example that uses this.