damballa / parkour

Hadoop MapReduce in idiomatic Clojure.
Apache License 2.0
257 stars 19 forks source link

Consider using Pangool? #1

Closed pereferrera closed 10 years ago

pereferrera commented 10 years ago

Would it be interesting for Parkour to use Pangool (http://pangool.net/) rather than Hadoop Java MapRed?

Pangool is a thin Java layer on top of Hadoop MapRed that makes most of the things easier (i.e. joins, secondary sort) and enhances it (using instances rather than classes, making multiple outputs / inputs cleaner, proper text i/o formats, etc) while keeping about the same performance (5% variation). By using a simple Tuple model the limitations of key/value disappear (so one can essentially group by any combination of fields). It has no flow management and it remains at the MapReduce level, being a suitable tool for writing raw MapReduce jobs.

We had the idea to create a Clojure API on top of Pangool (actually, we were first working on another abstraction for adding flow capabilities to Pangool, and planning to add Clojure on top of it), but never ended it so far. We have been running Pangool for almost 2 years now and will be releasing a 1.0 version not so far in the future. We believe the tool is pretty stable and strong, we have used it in many of our clients and have heard of other use cases through the mailing list and so on.

If this is interesting at all we are keen to help.

llasram commented 10 years ago

Pangool and the Tuple MapReduce paper look interesting, and I will dig a bit further, but I don't believe moving Parkour to use Pangool fits my goals for Parkour. I'm trying to keep Parkour as close to Hadoop as possible, directly working with and exposing Hadoop interfaces wherever possible. Pangool looks far more lightweight than e.g. Cascading, but still appears to add more extra abstraction than I'd like. Thanks though, and I'll re-open this ticket if I change my mind after taking a deeper look.