Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
567 stars 55 forks source link

pig/fold inconsistent behaviour #11

Closed chrisavl closed 10 years ago

chrisavl commented 10 years ago

I've noticed that pig/map in conjunction with a pig/fold behaves unexpectedly. The behaviour is also different from what we've seen running it on hadoop (latest AMI on EMR).

Here's a snippet explaining the problem; broken-fold-test returns [5] instead of [2]. This is because all the inputs to the count-true reduce function is wrapped in a vector, so all x evaluates to true. If this is expected behaviour what is wrong in this example?

(defn count-true []
  (let [combinef (fn
                   ([] 0)
                   ([a b] (+ a b)))
        reducef (fn [acc x] (if x (inc acc) acc))]
  (fold/fold-fn identity combinef reducef identity)))

(deftest working-fold-test
  (is (= (pig/dump (->> (pig/return [false false false true true])
                        (pig/fold (count-true))
                        ))
         [2])))

(deftest working-map-test
  (is (= (pig/dump (->> (pig/return [-2 -1 0 1 2])
                        (pig/map #(> % 0))
                        ))
         [false false false true true])))

(deftest broken-fold-test
  (is (= (pig/dump (->> (pig/return [-2 -1 0 1 2])
                        (pig/map #(> % 0))
                        (pig/fold (count-true))
                        ))
         [2]))) ; fails
mbossenbroek commented 10 years ago

That does sound like a bug. When you say the behavior is different on hadoop, what do you mean? Does the same code work there?

Thanks, Matt

On Monday, March 24, 2014 at 7:17 AM, Christian Lundgren wrote:

I've noticed that pig/map in conjunction with a pig/fold behaves unexpectedly. The behaviour is also different from what we've seen running it on hadoop (latest AMI on EMR). Here's a snippet explaining the problem; broken-fold-test returns [5] instead of [2]. This is because all the inputs to the count-true reduce function is wrapped in a vector, so all x evaluates to true. If this is expected behaviour what is wrong in this example? (defn count-true [](let [combinef %28fn %28[] 0%29 %28[a b] %28+ a b%29%29%29 reducef %28fn [acc x] %28if x %28inc acc%29 acc%29%29] %28fold/fold-fn identity combinef reducef identity%29)) (deftest working-fold-test (is (= (pig/dump (->> (pig/return [false false false true true]) (pig/fold (count-true)) )) [2]))) (deftest working-map-test (is (= (pig/dump (->> (pig/return [-2 -1 0 1 2]) (pig/map #(> % 0)) )) [false false false true true]))) (deftest broken-fold-test (is (= (pig/dump (->> (pig/return [-2 -1 0 1 2]) (pig/map #(> % 0)) (pig/fold (count-true)) )) [2]))) ; fails

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/11).

chrisavl commented 10 years ago

Yep, the same code works on hadoop and gives the correct output.

mbossenbroek commented 10 years ago

Found the problem. Pig has some annoying implicit logic with how it dereferences fields within a tuple. It looks like I missed the local conversion for group-all, meaning reduce would also have this issue.

I just released PigPen 0.2.1 to maven - it should show up in an hour or two. Once you can get it, could you confirm the fix and let me know?

Nice find and thanks for the very detailed repro!

Thanks, Matt

On Monday, March 24, 2014 at 8:51 AM, Christian Lundgren wrote:

Yep, the same code works on hadoop and gives the correct output.

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/11#issuecomment-38461114).

chrisavl commented 10 years ago

Sure, 0.2.1 does indeed fix it. Thanks for the quick fix!

mbossenbroek commented 10 years ago

FWIW, you don't actually need to define a fold-fn separately as in your example. You can compose fold functions like you would other seq operators:

(pig/dump
(->> (pig/return [-2 -1 0 1 2]) (pig/map #(> % 0)) (pig/fold (->> (fold/filter identity) (fold/count)))))

Or more simply:

(pig/dump
(->> (pig/return [-2 -1 0 1 2]) (pig/fold (->> (fold/filter #(> % 0)) (fold/count)))))

See more here: https://github.com/Netflix/PigPen/wiki/Folding-Data

Not sure if you wrote it that way just for an example, but thought this might be helpful.

Enjoy!

-Matt

On Tuesday, March 25, 2014 at 6:16 AM, Christian Lundgren wrote:

Sure, 0.2.1 does indeed fix it. Thanks for the quick fix!

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/11#issuecomment-38562312).