Closed chrisavl closed 10 years ago
That does sound like a bug. When you say the behavior is different on hadoop, what do you mean? Does the same code work there?
Thanks, Matt
On Monday, March 24, 2014 at 7:17 AM, Christian Lundgren wrote:
I've noticed that pig/map in conjunction with a pig/fold behaves unexpectedly. The behaviour is also different from what we've seen running it on hadoop (latest AMI on EMR). Here's a snippet explaining the problem; broken-fold-test returns [5] instead of [2]. This is because all the inputs to the count-true reduce function is wrapped in a vector, so all x evaluates to true. If this is expected behaviour what is wrong in this example? (defn count-true [](let [combinef %28fn %28[] 0%29 %28[a b] %28+ a b%29%29%29 reducef %28fn [acc x] %28if x %28inc acc%29 acc%29%29] %28fold/fold-fn identity combinef reducef identity%29)) (deftest working-fold-test (is (= (pig/dump (->> (pig/return [false false false true true]) (pig/fold (count-true)) )) [2]))) (deftest working-map-test (is (= (pig/dump (->> (pig/return [-2 -1 0 1 2]) (pig/map #(> % 0)) )) [false false false true true]))) (deftest broken-fold-test (is (= (pig/dump (->> (pig/return [-2 -1 0 1 2]) (pig/map #(> % 0)) (pig/fold (count-true)) )) [2]))) ; fails
— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/11).
Yep, the same code works on hadoop and gives the correct output.
Found the problem. Pig has some annoying implicit logic with how it dereferences fields within a tuple. It looks like I missed the local conversion for group-all, meaning reduce would also have this issue.
I just released PigPen 0.2.1 to maven - it should show up in an hour or two. Once you can get it, could you confirm the fix and let me know?
Nice find and thanks for the very detailed repro!
Thanks, Matt
On Monday, March 24, 2014 at 8:51 AM, Christian Lundgren wrote:
Yep, the same code works on hadoop and gives the correct output.
— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/11#issuecomment-38461114).
Sure, 0.2.1 does indeed fix it. Thanks for the quick fix!
FWIW, you don't actually need to define a fold-fn separately as in your example. You can compose fold functions like you would other seq operators:
(pig/dump
(->> (pig/return [-2 -1 0 1 2])
(pig/map #(> % 0))
(pig/fold (->> (fold/filter identity) (fold/count)))))
Or more simply:
(pig/dump
(->> (pig/return [-2 -1 0 1 2])
(pig/fold (->> (fold/filter #(> % 0)) (fold/count)))))
See more here: https://github.com/Netflix/PigPen/wiki/Folding-Data
Not sure if you wrote it that way just for an example, but thought this might be helpful.
Enjoy!
-Matt
On Tuesday, March 25, 2014 at 6:16 AM, Christian Lundgren wrote:
Sure, 0.2.1 does indeed fix it. Thanks for the quick fix!
— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/11#issuecomment-38562312).
I've noticed that pig/map in conjunction with a pig/fold behaves unexpectedly. The behaviour is also different from what we've seen running it on hadoop (latest AMI on EMR).
Here's a snippet explaining the problem;
broken-fold-test
returns[5]
instead of[2]
. This is because all the inputs to thecount-true
reduce function is wrapped in a vector, so allx
evaluates to true. If this is expected behaviour what is wrong in this example?