Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

NullPointer exception when folding over empty set locally #35

Closed chrisavl closed 10 years ago

chrisavl commented 10 years ago

You can trigger the bug with:

(pig/dump (pig/fold (fold/count) (pig/filter string? (pig/return [:foo, 42]))))

You'd expect it to dump [0] instead of throwing a NullPointer exception, maybe related to #34?

mbossenbroek commented 10 years ago

I found the problem, but the fix is a little weird. The NPE was from some Pig java code that doesn't handle nils well - that was easy to fix.

It did, however, expose an oddity in Pig. If you do a group-all on an empty sequence, Pig returns an empty sequence. Not a sequence with a single empty bag (as it should), but an empty sequence.

For now, I'm going to update PigPen to match Pig, but that means you'll see some weird stuff like this:

=> (->>
     (pig/return [])
     (pig/fold (fold/count))
     (pig/dump))
[]
=> (->>
     (pig/return [])
     (pig/reduce +)
     (pig/dump))
[]
=> (->>
     (pig/return [])
     (pig/into {})
     (pig/dump))
[]

I'll follow up with the Pig guys, but I have a gut feeling they'll say this behavior is intended (I'm guessing there's code somewhere that does this explicitly).

Off the top of my head, the only way to fix this in PigPen would be to add a placeholder value to every sequence & ignore it later. I use a similar strategy to corral joining nils into something sane, but I don't think that same approach would be as easy here.

I'll get the above fix out later today so at least it doesn't blow up when reducing an empty seq & open another issue to track fixing the problem correctly.

Thanks for the bug report!

mbossenbroek commented 10 years ago

It was as I expected. When Pig sees no data returned from one stage, it bails out & doesn't run the rest, which is actually really nice most of the time.

The more I think about this, I can't think of a compelling use case for correcting this. It really bothers me that it's not correct, but it would be a kludgy fix without support from Pig & I'm struggling to see a real world use case where the difference between a 0 and an empty output file would be meaningful.

If you can think of a valid use case, please let me know & I'll look into it further. Obviously I'll still fix it to not blow up, but for the time being I might just go with the behavior above.

mbossenbroek commented 10 years ago

The new build has been released to maven. Should be available in ~1hr

chrisavl commented 10 years ago

Thanks for looking into it so quickly. Agreed, matching pig's behaviour makes total sense.