Closed mizitch closed 8 years ago
Note: Everything is a pretty much direct backport except for pom.xml. The beam and dataflow poms are different enough that it was easier to take the original pom.xml Marian had written for the earlier version of the sort library and manually make the necessary changes to dependencies and shading. So aside from the CR done on Marian's version of the file months ago, that file hasn't been code reviewed.
Ran mvn clean verify successfully on the sorter contrib module and the parent (looks like the parent pom doesn't do builds of the contrib modules in dataflow sdk).
LGTM, thanks.
A contrib module that provides a PTransform which performs local(non-distributed) sorting. It will sort in memory until the buffer is full, then flush to disk and use external sorting.
Consumes a PCollection of KVs from primary key to iterable of secondary key and value KVs and sorts the iterables. Would probably be called after a GroupByKey. Uses coders to convert secondary keys and values into byte arrays and does a lexicographical comparison on the secondary keys.
Uses Hadoop as an external sorting library.
Backport of https://github.com/apache/incubator-beam/pull/1199