Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

Custom loaders can't use functions defined elsewhere #46

Closed pkozikow closed 10 years ago

pkozikow commented 10 years ago

The example on the wiki shows this:

(defn square [x]
  (* x x))

(defn baz [data]
  (->> data
    (pig/map square)))

I'm trying to use a function like square above from inside a custom loader (raw/load$), but that results in an "Unable to resolve symbol" exception. Defining the function in a separate namespace and requiring it doesn't work either.

I'd guess this could be solved the same way as for operations like pig/map in the example.

pkozikow commented 10 years ago

Update: using functions in a separate ns using the full name works (e.g. my.full.ns/my-func), but one can't use an ns alias (my.full.ns :as x, and then x/my-func)

mbossenbroek commented 10 years ago

What you need is pigpen.code/trap. This function does the job of trapping the lexical scope and current namespace. It re-writes the user function & wraps it as such:

(defn foo [x]
  (let [y 42]
    (pigpen.code/trap (fn [z] (+ x y z)))))

If you call this, it will produce this:

=> (foo 123)

(pigpen.pig/with-ns pigpen-demo.core
  (clojure.core/let [y (quote 42)
                     x (quote 123)]
    (fn [z] (+ x y z))))

Here's how it's used in pigpen.core/map: https://github.com/Netflix/PigPen/blob/master/pigpen-core/src/main/clojure/pigpen/map.clj#L57

This function is generally used within a macro to prevent clojure from evaluating the function. The map example above is a good pattern to use - have the macro only add the code/trap and then call map* with the trapped function.

Here's some skeleton code for a loader that would trap a function:

(defn square [x]
  (* x x))

(defn load-custom* [location f]
  (->
    (raw/load$ location '[value] raw/default-storage {})
    (raw/bind$ `(pigpen.pig/map->bind ~f) {})))

(defmacro load-custom [location f]
  `(load-custom* ~location (pigpen.code/trap ~f)))

Let me know if that helps.

Thanks, Matt

pkozikow commented 10 years ago

Thanks for the prompt response. I'm just getting my head around this, but wouldn't it be possible to make map->bind be a macro that does this so that the user doesn't need to care about it? (just like map).

mbossenbroek commented 10 years ago

It might be possible to clean up the way that custom loaders are implemented (I certainly didn't expect it to be one of the most common things users would do), but you'd still be stuck with using macros & pigpen.code/trap if you need to take a user function as an argument. If you don't, clojure evaluates the function and we get a compiled function, which can't easily be serialized into a pigpen script.

Generally, trap is separate from map->bind because it's used in many other places & it's usually easier to pick which of the bind operations is appropriate for the situation. There's also pigpen.pig/filter->bind and pigpen.pig/mapcat->bind, which convert those types of functions into mapcat-like functions that can be squished into one big mapcat operation. This way it keeps these distinct operations de-complected from one another.

That said, if you've got any ideas for making it cleaner, I'm totally open to making changes to this part of pigpen. I've looked at a number of different loader implementations & haven't seen a good generalization yet.

pkozikow commented 10 years ago

Update: I spoke too soon. Using functions from a separate ns as described above only works locally. When I tried it on the cluster, I got a ClassNotFoundException. After some experimenting, the only way I could get it to work is to add :gen-class to that ns with the method I'm using defined with :methods. At that point I gave up and implemented the function as a Java static method. If (when) this comes up again, I'll try to use the trap approach you suggest above.

By the way, this was the last hurdle before being able to run my first PigPen job on our real cluster. Hooray!

mbossenbroek commented 10 years ago

Oh no! You definitely shouldn't have to go to that extent. I thought I had it set up such that if it worked in the repl it would work in the cluster, but it looks like I missed a case. I might have to start using something that will produce a clean, isolated ns when running locally.

pigpen.code/trap is definitely what you're looking for though. Feel free to send me your loader code & I can help you get trap working with it.