Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

Mapping over many input files #42

Closed anthgur closed 10 years ago

anthgur commented 10 years ago

Given partial input file names that are part of pairs with something like "x_Start.csv", "x_End.csv":

(def inputs ["abc" "def" "ghi" "jkl"])

Each of the files in the pairs have an ID column, and some ids only exist in one file. What I want to do is join on the ids of each pair, and concat the result of all of the joins to input to further processing.

My approach thus far has been something similar to this:

;; load xyz_Start.csv and xyz_End.csv and return a [start end] pair
;; of loaded data from each file
(defn load-by-xyz [xyz] ...) 
;; manipulate the [start end] pair and perform the join
(defn do-stuff [...]) 
(pig/mapcat #(-> (load-by-xyz %) do-stuff) inputs)

I've tried this multiple ways, but I always get:

java.lang.AssertionError: Assert failed: (map? relation)

If I run the load -> do-stuff portion by itself on a single pair it works, so my guess is that it has something to do with inputs not being data that pig can recognize.

Any suggestions as to how I can get something like this to work?

Thanks

mbossenbroek commented 10 years ago

That error indicates that what you're trying to use as an input relation isn't an input relation. In this case, it looks like you're using strings.

I think what you want is just a regular clojure.core/map to create a bunch of inputs:

(->> inputs
(map load-by-xyz) (map do-stuff) (apply pig/concat))

What this does is create a sequence of relations, one for each value in inputs. We then apply the do-stuff function to each of them and apply pig/concat at the end. This is the equivalent of taking a bunch of relations & unioning them together.

Let me know if that's what you're looking for. I wasn't quite sure what you meant by the [start end] pairs or the joins.

Hope that helps.

-Matt

On Wednesday, July 9, 2014 at 11:28 AM, anthonyurena wrote:

Given partial input file names that are part of pairs with something like "x_Start.csv", "x_End.csv": (def inputs ["abc" "def" "ghi" "jkl"])

Each of the files in the pairs have an ID column, and some ids only exist in one file. What I want to do is join on the ids of each pair, and concat the result of all of the joins to input to further processing. My approach thus far has been something similar to this: ;; load xyz_Start.csv and xyz_End.csv and return a [start end] pair ;; of loaded data from each file (defn load-by-xyz [xyz] ...) ;; manipulate the [start end] pair and perform the join (defn do-stuff [...]) (pig/mapcat #(-> (load-by-xyz %) do-stuff) inputs)

I've tried this multiple ways, but I always get: java.lang.AssertionError: Assert failed: (map? relation)

If I run the load -> do-stuff portion by itself it works, so my guess is that it has something to do with inputs not being data that pig can recognize. Any suggestions as to how I can get something like this to work? Thanks

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/42).

anthgur commented 10 years ago

That seems to be exactly what I was going for. I confused myself by thinking that I would need to use the pigpen versions of the clojure.core functions for all operations.

Thanks