NICTA / scoobi

A Scala productivity framework for Hadoop.
http://nicta.github.com/scoobi/
482 stars 97 forks source link

DList#groupBy and DList#groupWith cannot co-exist #221

Open tonymorris opened 11 years ago

tonymorris commented 11 years ago

Both DList#groupBy and DList#groupWith differ only in that one of them abuses implicit resolution. One of them must be deleted.

A quick-fix is to delete DList#groupWith where the abuse is manifested, but there may be a more significant improvement available by resolving the underlying tension that compels such an abuse.

espringe commented 11 years ago

Imho Grouping has been, and is, really useful, but perhaps it should be retired completely. It seems like it tries to do too much (e.g. (secondary) sorting), while not being able to do enough (it's can't use any information from the values, unlike hadoop's Partitioner) ..and its a little awkward to use (you just hope that keys/values in the subsequent parallel do obey your constraints.

So my proposal would be to just jack hadoops interface verbatim. (And use implicits to provide one for all types)

trait Partioner[K, V] {
     def partition(key: K, value: V, totalPartitions: Int): Int = mod(key.hashCode(), totalPartitions)
} 

And DList[(K,V)].groupByKey require a Ordering[K] along with a Partioner[K, V]. We could provide an implicit that provides a Partioner[K,V] for all K's and V's. Which would quite elegantly cover the basic uses.

Then to support the crazy-ass stuff, I think we could just have a single function like:

DList[(K,V)] { 
     def crazyDo[B](  // none of these implicit, since you really want to manually specify them
             partioner: Partioner[K,V],
             keyOrder:  Ordering[K], // order the keys are presented to the reducer
             grouping: KeyCollapser[K], // more on this below
     ): GroupedDList[K, V]
}

Where KeyCollapser might be defined as: trait KeyCollapser[K]{ def shouldCollapse[K](a: K, b: K): Boolean }

which conceptually is the function that slides over the list of Keys, after they have gone to a partition, and after hadoop has sorted them -- decided if they should be compressed into one or not.

(Although, it might be safer to not go with a KeyCollapser, and just require another Ordering).

And GroupedList[K, V] would be a special kind of a DList that only accepts a doFn (or something) where it guarantees that it'll be in the sorted/grouped fasion

Thoughts?

espringe commented 11 years ago

And I'd assume the way I propose could result in performance advantages -- as we would only have to tell hadoop to sort the keys if it is actually needed

blever commented 11 years ago

@tonymorris - can you add a comment discussing your similar experience with scalaz, e.g. 'add' vs 'multiply' monoids.