Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

Add partition-by, initially only to distinct #25

Closed mbossenbroek closed 10 years ago

mbossenbroek commented 10 years ago

@daveray @pathaks @johnmidgley

Added a :partition-by option to distinct. It should be relatively easy to add partitioners to other operators that could take them, but I only needed distinct for now.

Unfortunately, Hadoop doesn't allow for parameters to be passed to partition functions, so the workaround was to generate many of them. Right now it makes 32 of them, but this can be easily changed if there is a need. This also introduces state into the script generation process (which partitioner to use), which is now explicitly passed through that command.

The generated partitioners all extend the same base class and determine which config to load based on their name, which includes a numerical index.