Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
566 stars 55 forks source link

Libraries/Functions in closures #136

Closed ljank closed 9 years ago

ljank commented 9 years ago

As stated in the docs:

PigPen supports a number of different types of closures. (...) Compiled functions and mutable structures like atoms won't work.

We have time in millis in our data and would like to format it as YYYY-MM-DD, but that's impossible due to aforementioned reasons :( Are there any workaround to make functions work? Otherwise this statement looks far fetched:

There are no special user defined functions (UDFs). Define Clojure functions, anonymously or named, and use them like you would in any Clojure program.

Thank you!

mbossenbroek commented 9 years ago

This just means that you can't close over a compiled function. For example:

(require '[simple-time.core :as st]))

;; works
(defn format-ts [data]
  (pig/map (fn [x] (st/format x :date)) data))

(format-ts my-data)

;; works
(defn format-ts [data format]
  (pig/map (fn [x] (st/format x format)) data))

(format-ts my-data :date)

;; won't work because f is compiled
(defn format-ts [data f]
  (pig/map f data))

(format-ts my-data (fn [x] (st/format x :date)))

There is a way around this, but it's not officially supported yet:

(defn format-ts [data f]
  (pigpen.map/map* f data))

(format-ts my-data (pigpen.code/trap (fn [x] (st/format x :date))))

Let me know if that's not clear or if you have a specific example of what you're trying to do.

Also, check out pigpen-support@googlegroups.com or https://groups.google.com/forum/#!forum/pigpen-support for future questions.

ljank commented 9 years ago

I still get CompilerException java.lang.RuntimeException: No such namespace: st in cases that meant to be working :\

mbossenbroek commented 9 years ago

Could you send a code sample and stack trace that you get?

mbossenbroek commented 9 years ago

Might be worth mentioning - any code that you close over needs to be in a file that will end up in the uberjar that goes to hadoop. If you're just in a user ns in a repl, the code I listed won't work.

If that's the case, let me know if that's not clear from the docs & I can update them.

ljank commented 9 years ago

I've spotted that it behaves differently when using pig/return and loading data from files (same for JSON and Avro). This works just fine:

(require '[simple-time.core :as st])

(defn time->ymd
  [data]
  (pig/map (fn [entry]
             (assoc entry
               :ymd (st/format (st/datetime (:time entry)) :date)))
           data))

(->> (pig/return [{:time 1425254400010} {:time 1425254400019} {:time 1425254400090}])
     (time->ymd)
     (pig/dump))
; [{:ymd "2015-03-02", :time 1425254400010} 
;  {:ymd "2015-03-02", :time 1425254400019} 
;  {:ymd "2015-03-02", :time 1425254400090}]

For JSON:

(spit "/tmp/events.json" "{\"time\": 1425254400010}\n{\"time\": 1425254400019}\n{\"time\": 1425254400090}")
(->> (pig/load-json "/tmp/events.json")
     (time->ymd)
     (pig/dump))

CompilerException java.lang.RuntimeException: No such namespace: st

Same error while using Avro.

mbossenbroek commented 9 years ago

Yeah, it sounds like you're in a user ns. This complete example works for me:

(ns pigpen-demo.core
  (:require [pigpen.core :as pig]
            [simple-time.core :as st]))

(defn time->ymd
  [data]
  (pig/map (fn [entry]
             (assoc entry
               :ymd (st/format (st/datetime (:time entry)) :date)))
           data))

(clojure.pprint/pprint
  (->> (pig/return [{:time 1425254400010} {:time 1425254400019} {:time 1425254400090}])
       (time->ymd)
       (pig/dump)))

(spit "/tmp/events.json" "{\"time\": 1425254400010}\n{\"time\": 1425254400019}\n{\"time\": 1425254400090}")

(clojure.pprint/pprint
  (->> (pig/load-json "/tmp/events.json")
       (time->ymd)
       (pig/dump)))

and produces this output:

[{:ymd "2015-03-01", :time 1425254400010}
 {:ymd "2015-03-01", :time 1425254400019}
 {:ymd "2015-03-01", :time 1425254400090}]
Start reading from  /tmp/events.json
Stop reading from  /tmp/events.json
[{:ymd "2015-03-01", :time 1425254400010}
 {:ymd "2015-03-01", :time 1425254400019}
 {:ymd "2015-03-01", :time 1425254400090}]
nil

If you're in a file & still getting that exception, could you run these commands in the REPL and let me know what you get?

(pigpen.code/trap identity)
(ns-name *ns*)
(pigpen.code/ns-exists *1)
ljank commented 9 years ago

You're right — everything works fine when running from file and being not in a user namespace. Thank you for lightning fast and correct diagnosis!

Next time I'll use mailgroup. Sorry!