Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
566 stars 55 forks source link

Incorrect script generation with large number of fields (parquet) #174

Closed BrunoBonacci closed 8 years ago

BrunoBonacci commented 8 years ago

Hi,

one of my datasets has close to 700 fields, however the write-script function only generates a smaller number. This is due to the Clojure behaviour of print and str function which depends on *print-length*.

When working with the REPL you typically have a (set! *print-length* 100) in your .lein/profiles.clj This causes long sequences (and infinite lazy sequences) to only print the first 100 items.

Clojure print functions (and str) obey to this setting. So for example the following code will generate a incorrect output:

  (require '[pigpen.pig])
  (require '[pigpen.core :as pig])
  (require '[pigpen.parquet :as pqt])
  (require '[clojure.string :as str])

  (spit "/tmp/test.tsv"
        (->> (range 1 701) (map str) (str/join "\t") (repeat 10) (str/join "\n") (#(str % "\n"))))

  (defn schema700 []
    (apply pqt/message "big-schema"
           (doall
            (for [i (range 700)]
              (pqt/binary (str "value" (inc i)))))))

  (defn copy-to-parquet
    [input output]
    (->> (pig/load-tsv input)
         (pqt/store-parquet
          output
          (schema700))))

  (copy-to-parquet "/tmp/test.tsv" "/tmp/test.pq")

  (set! *print-length* 100)

  (pigpen.pig/write-script
   "my-script.pig"
   (copy-to-parquet "/tmp/test.tsv" "/tmp/test.pq"))

now the output script will look like:

DEFINE udf34778 pigpen.PigPenFn('(clojure.core/require (quote [pigpen.runtime]) (quote [pigpen.extensions.core]))','(clojure.core/comp (pigpen.runtime/process->bind (pigpen.runtime/pre-process :pig :native)) (pigpen.runtime/map->bind (clojure.core/fn [s] (if s (pigpen.extensions.core/structured-split s "\\t")))) (pigpen.runtime/keyword-field-selector->bind [:value1 :value2 :value3 :value4 :value5 :value6 :value7 :value8 :value9 :value10 :value11 :value12 :value13 :value14 :value15 :value16 :value17 :value18 :value19 :value20 :value21 :value22 :value23 :value24 :value25 :value26 :value27 :value28 :value29 :value30 :value31 :value32 :value33 :value34 :value35 :value36 :value37 :value38 :value39 :value40 :value41 :value42 :value43 :value44 :value45 :value46 :value47 :value48 :value49 :value50 :value51 :value52 :value53 :value54 :value55 :value56 :value57 :value58 :value59 :value60 :value61 :value62 :value63 :value64 :value65 :value66 :value67 :value68 :value69 :value70 :value71 :value72 :value73 :value74 :value75 :value76 :value77 :value78 :value79 :value80 :value81 :value82 :value83 :value84 :value85 :value86 :value87 :value88 :value89 :value90 :value91 :value92 :value93 :value94 :value95 :value96 :value97 :value98 :value99 :value100 ...]) (pigpen.runtime/process->bind (pigpen.runtime/post-process :pig :native)))');

Go at the end of the above line and notice the ... after :value100.

So only the first 100 values were printed.

Now a workaround for this issue is to wrap the write-script in a binding like:

(binding [*print-length* false]
    (pigpen.pig/write-script
     "my-script.pig"
     (copy-to-parquet "/tmp/test.tsv" "/tmp/test.pq")))

But I feel this should be handled by the generate-script directly. Additionally you might want to check the *print-level* as well.

[1] - https://clojuredocs.org/clojure.core/*print-length [2] - https://clojuredocs.org/clojure.core/print-level*

mbossenbroek commented 8 years ago

That's an interesting one - I haven't used those settings before. It sounds like the default behavior is ok (no limits), but if you have them set then it's not explicitly unsetting them before script generation.

It should be relatively straightforward to explicitly bind them to false for script generation. Let me take a quick stab at that.

BrunoBonacci commented 8 years ago

yes, you are right. By default are set nil but anyone who is using the REPL do develop on Clojure has these set in the profiles.clj to avoid to hang the REPL on a infinite sequence.

As mentioned as well the workaround of the binding form works well

(binding [*print-length* false]
    (write-script .....))

But it would be better that the generate-script does it internally, so that people who have large number of fields won't stumble across the same issue.

mbossenbroek commented 8 years ago

Released in v0.3.3

BrunoBonacci commented 8 years ago

Tested, it works. many thanks for the quick response.