Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
566 stars 55 forks source link

CUBE/ROLLUP in PigPen #134

Open ljank opened 9 years ago

ljank commented 9 years ago

This project just blew my mind! I'm super excited to find it, but.. Are there any plans to introduce CUBE/ROLLUP functions? Are these on a roadmap?

I'd be happy to contribute, just that I'm not sure yet whether it won't contradict any future plans (i.e. migration to other backend; or this will be separate project?)

Thank you!

mbossenbroek commented 9 years ago

Glad you're enjoying it! I just read about cube/rollup this morning, but it seems pretty straightforward. From how I read it, it looks like it's just a mapcat operation to expand a group key into a sentinel version of the key:

[:a 42 "foo"] > ([:a 42 "foo"] [:a 42 nil] [:a nil "foo"] [nil 42 "foo"]) or [:a 42 "foo"] > ([:a 42 "foo"] [:a 42 nil] [:a nil nil] [nil nil nil])

Then you'd just group-by that set of keys instead of the original one.

Here's a plain clojure example I was playing with:

(defn cube []
  (let [sentinel '*
        cube-key (fn [& fns]
                   (let [key-fn (apply juxt fns)]
                     (fn [value]
                       (let [key (key-fn value)]
                         (cons
                           [key value]
                           (for [i (range (count key))]
                             [(assoc key i sentinel) value]))))))]
  (->> [{:a 1, :b 2, :c 3}
        {:a 1, :b 2, :c 4}
        {:a 2, :b 3, :c 5}]
    (mapcat (cube-key :a :b))
    (group-by first)
    (map (fn [[k vs]] [k (map second vs)])))))

To make this run in pigpen today, just add pig/ to the final mapcat, group-by, and map commands. I also preferred to use a different sentinel value than nil like pig does - just in case there are nils in the user data. Ideally we'd want a good default and allow the user to override it.

And then the rollup version seems to be just a different function of the grouping key:

(defn rollup []
  (let [sentinel '*
        rollup-key (fn [& fns]
                     (let [key-fn (apply juxt fns)]
                       (fn [value]
                         (let [key (key-fn value)]
                           (reductions
                             (fn [[k v] i]
                               [(assoc k (dec i) '*) v])
                             [key value]
                             (range (count key) 0 -1))))))]
    (->> [{:a 1, :b 2, :c 3}
          {:a 1, :b 2, :c 4}
          {:a 2, :b 3, :c 5}]
      (mapcat (rollup-key :a :b))
      (group-by first)
      (map (fn [[k vs]] [k (map second vs)])))))

So, that's what we want to do, now how & where do we implement this?

We could implement these to use Pig's CUBE/ROLLUP operators, but I doubt that they're doing anything special and it would block us from using these in other platforms. Since we can trivially implement these using the existing operators (and make them more flexible), let's stick with that for now. I can follow up & make sure that pig isn't doing any additional optimizations here that we would miss out on.

We could add cube-by and rollup-by operators as described above, which are just compositions of existing operators, but I generally try to stay away from adding a lot of new, special purpose operators that aren't in clojure.core.

I think a better generalization would be a group-by where the key-fn returns multiple values instead of one. Let's call it group-by-many for now, but I'm open to different names. Similar to how mapcat expects a function that returns a seq, so would group-by-many. We would then supply cube-key and rollup-key as described above, or the user could specify their own.

Then we'd end up with a commands like this:

(group-by-many (cube-key :a :b) relation) or (group-by-many (rollup-key :a :b) relation)

That said, you were right when you alluded to a different platform - we are about to release a cascading version of pigpen. This is probably the kick in the butt I need to finish that work. :) Of course, the remaining changes I have to make to that branch would most certainly break whatever work you would do to add these new operators - go figure.

Apologies for the long-winded ness of this response, but if you're still interested, take a look at join.clj where the existing group-by is defined. The change would be to create group-by-many and modify select->generate to accommodate a key-fn that returns multiple values.

Until then, the first example described above would certainly work as-is today. That would get you started without having to change anything.

Thoughts?

Thanks, Matt

ljank commented 9 years ago

Wow, thank you for so detailed and fast response!

You're right that it's simple to imitate that same behavior, just wanted to make sure I wasn't going to reinvent the wrong wheel. I'll try to implement cube/rollup solution sometime this week and will get back to you with the results.

Thanks again!