Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

Don't require join function to be anonymous #19

Closed kyptin closed 10 years ago

kyptin commented 10 years ago

With PigPen 0.2.3, I was using join, but instead of specifying an anonymous function inline with the join call, I defnd a function and just used the name of the function. In other words, instead of:

(join [(xs :on first)
       (ys :on first)]
      (fn [x y] ...))

...I was doing:

(defn foo [x y] ...)
(join [(xs :on first)
       (ys :on first)]
      foo)

The second version produced output with the same structure as the first version, except that there were nils in most places. My guess is that the macros aren't quite evaluating things properly, but I don't know this for sure.

I can't share more details, unfortunately, as they are proprietary. Although, if you're having difficulty reproducing this issue, I can try to reproduce it in a way that I can share.

An issue which may be related is that a print statement in the function in version 1 works, but in version 2 it does not print anything.

Thanks very much! -Jeff T.

mbossenbroek commented 10 years ago

Unfortunately I can't repro that one.

Does the problem happen locally, on the cluster, or both? What's the type of the key you're trying to join on?

This works for me:

(defn foo [x y](prn x y) {:x x, :y y})

(deftest test-join
(let [xs (pig/return [[1 "a"] [1 "b"] [2 "a"]]) ys (pig/return [[1 "a"] [2 "b"] [2 "a"]]) command (pig/join [(xs :on first) (ys :on first)] foo)](is %28= %28pig/dump command%29 [{:x [2 "a"], :y [2 "b"]} {:x [2 "a"], :y [2 "a"]} {:x [1 "a"], :y [1 "a"]} {:x [1 "b"], :y [1 "a"]}]))))

I can also print from the function:

=> (test-join)
[2 "a"] [2 "b"] [2 "a"] [2 "a"] [1 "a"] [1 "a"] [1 "b"] [1 "a"] nil

Sometimes when running locally, code will execute on other threads. At least for CCW, this causes it to appear in the console instead of the REPL, which is kind of annoying. If you're using CCW, could you check the console output? If not, what editor are you using?

To repro, what commands are you using before the join? Are you loading data from a file, doing any transformations, etc?

Thanks, Matt

On Sunday, March 30, 2014 at 5:57 PM, Jeff Terrell wrote:

With PigPen 0.2.3, I was using join, but instead of specifying an anonymous function inline with the join call, I defnd a function and just used the name of the function. In other words, instead of: (join [(xs :on first) (ys :on first)](fn [x y] ...))

...I was doing: (defn foo [x y] ...) (join [(xs :on first) (ys :on first)] foo)

The second version produced output with the same structure as the first version, except that there were nils in most places. My guess is that the macros aren't quite evaluating things properly, but I don't know this for sure. I can't share more details, unfortunately, as they are proprietary. Although, if you're having difficulty reproducing this issue, I can try to reproduce it in a way that I can share. Relatedly, is there a good reason why my print statements don't work in the join function? If that's easy to fix, that would be helpful for my debugging. Thanks very much! -Jeff T.

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/19).

kyptin commented 10 years ago

I'm running locally, in a lein repl session. I'm using vim to edit the code.

I'm trying to join vectors. The key function for each vector is simply first.

I am doing a variety of transformations before the join, but I am not loading from a file.

I'll try to create a reproducible failure case tonight or tomorrow.

mbossenbroek commented 10 years ago

Thanks. What's the data type of the join key?

Are you joining large maps or data structures? Or is it joining numbers, strings, keywords, or some other primitive?

The thread-switching happens when you're locally reading from a file, so that's the only reason I can think of for the printing not working.

The example I listed before prints when I run from a lein repl too.

Let me know what you can come up with for a repro case!

-Matt

On Sunday, March 30, 2014 at 6:50 PM, Jeff Terrell wrote:

I'm running locally, in a lein repl session. I'm using vim to edit the code. I'm trying to join vectors. The key function for each vector is simply first. I am doing a variety of transformations before the join, but I am not loading from a file. I'll try to create a reproducible failure case tonight or tomorrow.

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/19#issuecomment-39047921).

kyptin commented 10 years ago

I'm joining on strings, so yeah, it's a primitive.

Heh, I guess it's on me to reproduce this, then—you've certainly done your due diligence. Thanks!

mbossenbroek commented 10 years ago

I followed up with Jeff on another thread & we found that the problem was a stale fn in the REPL. Restarting the REPL fixed the issue.

Right now I'm memoizing user functions based on what you pass to the pigpen operator. This has the unfortunate side effect of using stale versions of named functions. In your case this means that if you load foo, load the join, and then modify foo, it'll use the first version.

The reason for this is historical and for performance. I never want to re-eval the same code on the cluster and on the cluster you never change the code, hence the memoization. In the past, defining a function not-inline wasn't supported so this wasn't a problem.

Fix coming soon...

-Matt

On Sunday, March 30, 2014 at 7:08 PM, Jeff Terrell wrote:

I'm joining on strings, so yeah, it's a primitive. Heh, I guess it's on me to reproduce this, then—you've certainly done your due diligence. Thanks!

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/19#issuecomment-39048583).