Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

Does PigPen support subqueries (FOREACH within query)? #8

Closed cmcarthur closed 10 years ago

cmcarthur commented 10 years ago

I want to run a query exactly like the one described here:

http://mail-archives.apache.org/mod_mbox/pig-user/201109.mbox/%3CEF686A3D2941844BB7D33062F3E57C894C11A5C8E4@DEN-MEXMS-001.corp.ebay.com%3E

In my working example, I'm using clojure.core's sort-by and take as a workaround, but I'd love to be able to build this into my query.

Thanks!

mbossenbroek commented 10 years ago

PigPen does support subqueries, but it doesn't use Pig's nested inner bag syntax to execute them. It sounds like what you're doing is close to the recommended way in PigPen.

I would use max-key instead of a sort/take approach. It's linear time (n) and constant space (1).

(ns pigpen-demo.core
(:use clojure.test) (:require [pigpen.core :as pig] [clj-time.core :as time] [clj-time.coerce :as coerce]))

(deftest test-subquery
(let [command (->> (pig/return [["A" 10 (time/date-time 2011 01 01 23 59 00)] ["A" 11 (time/date-time 2011 01 01 23 59 59)] ["A" 12 (time/date-time 2011 01 01 23 00 59)] ["B" 20 (time/date-time 2011 02 01 01 00 00)] ["B" 21 (time/date-time 2011 02 02 01 00 00)] ["C" 30 (time/date-time 2011 03 01 03 00 00)]]) (pig/group-by first) (pig/map (fn [[ values]](apply max-key %28fn [[ _ dt]] %28coerce/to-long dt%29%29 values))))](is %28= %28set %28pig/dump command%29%29

{["A" 11 %28time/date-time 2011 01 01 23 59 59%29]

         ["B" 21 %28time/date-time 2011 02 02 01 00 00%29]
         ["C" 30 %28time/date-time 2011 03 01 03 00 00%29]}%29)))

As to Pig's nested inner bag syntax, it really isn't buying you anything - it's just syntactic sugar in Pig. There's no performance advantage over simply writing a Clojure function that PigPen consumes via a Pig UDF. What you can do in a nested block is very limited, and for interesting queries, you often have to fall back to using UDFs anyway. Also, the Pig scripts generated by PigPen aren't intended to be edited or maintained by humans, so using a slightly different syntax to accomplish the same task didn't seem advantageous.

On the contrary, since PigPen does everything in a UDF, you can do anything in your function. Any Clojure function is fair game, making it much more flexible than Pig's inner bag syntax. What you came up with (using sort-by and take) is exactly what makes PigPen powerful - you can use the full power of Clojure anywhere in your script.

Let me know if that helps or if it isn't what you were looking for.

Thanks, Matt

On Saturday, February 8, 2014 at 6:10 PM, Connor McArthur wrote:

I want to run a query exactly like the one described here: http://mail-archives.apache.org/mod_mbox/pig-user/201109.mbox/%3CEF686A3D2941844BB7D33062F3E57C894C11A5C8E4@DEN-MEXMS-001.corp.ebay.com%3E In my working example, I'm using clojure.core's sort-by and take as a workaround, but I'd love to be able to build this into my query. Thanks!

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/8).

cmcarthur commented 10 years ago

Hey Matt,

This is incredibly informative and your max-key usage is enlightening.

Thanks so much for your help.

Best, Connor