Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

problem with load-tsv function #57

Closed micrub closed 10 years ago

micrub commented 10 years ago

Hi ,

I am trying to run following function on tsv file with more than 100k line, while running it on my laptop. The function looks like this.

(defn- hashed-data  [file-name]
  (->>
    (pig/load-tsv file-name)
    (pig/map  (fn  [[ & args]]
                [args]))))

Instead of getting same amount of lines like in input file I am always getting only 1000 items. do I miss something obvious ?

mapstrchakra commented 10 years ago

You need to change the binding of how many results you are returning. By default it returns a 1000 items in the REPL

(def ^:dynamic max-load-records 1000).

You can wrap your function with a max-load-records binding as follows:

(defn- hashed-data [file-name](binding [pigpen.local/max-load-records 100000 %28->> %28pig/load-tsv file-name%29 %28pig/map %28fn [[ & args]] [args]%29%29%29)))

On Sunday, September 7, 2014, Michael Rubanov notifications@github.com wrote:

Hi ,

I am trying to run following function on tsv file with more than 100k line, while running it on my laptop. The function looks like this.

(defn- hashed-data [file-name](->> %28pig/load-tsv file-name%29 %28pig/map %28fn [[ & args]] [args]%29%29))

Instead of getting same amount of lines like in input file I am always getting only 1000 items. do I miss something obvious ?

— Reply to this email directly or view it on GitHub https://github.com/Netflix/PigPen/issues/57.

mbossenbroek commented 10 years ago

Yeah, I put that cap in there because the version of rx I'm using doesn't unsubscribe properly from the observable. It's kind of a hacky fix, but this prevents you from processing potentially large files just to throw the result away. In general, the REPL should only be used for vetting your code & then you'd run at scale on the cluster, but 100k should be well within the limits of what it can handle locally.

At Netflix, we sample large GB files over the network directly into pigpen - without this limit it was just continuing to download the file on a background thread & slowing down the REPL. This was painful when I just wanted the first 10 records.

The longer term fix is to upgrade the version of rx that I'm using, but they tend to break their API frequently so I've been waiting for v1.0 to be released.

-Matt

On Sunday, September 7, 2014 at 11:17 AM, mapstrchakra wrote:

You need to change the binding of how many results you are returning. By
default it returns a 1000 items in the REPL

(def ^:dynamic max-load-records 1000).

You can wrap your function with a max-load-records binding as follows:

(defn- hashed-data [file-name]

(binding [pigpen.local/max-load-records 10000
(->>
(pig/load-tsv file-name)
(pig/map (fn [[ & args]]
[args]))))))

On Sunday, September 7, 2014, Michael Rubanov <notifications@github.com (mailto:notifications@github.com)>
wrote:

Hi ,

I am trying to run following function on tsv file with more than 100k
line, while running it on my laptop.
The function looks like this.

(defn- hashed-data [file-name](->>
%28pig/load-tsv file-name%29
%28pig/map %28fn [[ & args]]
[args]%29%29))

Instead of getting same amount of lines like in input file I am always
getting only 1000 items.
do I miss something obvious ?


Reply to this email directly or view it on GitHub
https://github.com/Netflix/PigPen/issues/57.

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/57#issuecomment-54755221).

micrub commented 10 years ago

Thanks for clarification , though following wrapping didn't solved this issue in repl , still getting 1000 items back:

(binding [pigpen.local/*max-load-records* 100000])
mbossenbroek commented 10 years ago

Your example shows the trailing paren after the binding expression… To use binding, you need to enclose the code that requires the rebinding.

(binding [pigpen.local/max-load-records 100000](->> %28pig/load-tsv) (pig/dump)))

In this case, you'd want to make sure that the code calling pig/dump is what gets wrapped, not the load command. The load command just builds an expression tree.

(def x (pig/load …))

(binding [pigpen.local/max-load-records 100000](pig/dump x))

Let me know if that works for you.

-Matt

On Sunday, September 7, 2014 at 2:35 PM, Michael Rubanov wrote:

Thanks for clarification , though following wrapping didn't solved this issue in repl , still getting 1000 items back: (binding [pigpen.local/max-load-records 100000]) ``

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/57#issuecomment-54761550).

mbossenbroek commented 10 years ago

After thinking about this some more, I'm going to change the default to be unlimited and add this as an option to limit it only if you need it.

mbossenbroek commented 9 years ago

Fixed entirely in #61 This binding is no longer necessary.