Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

Add Parquet support #44

Closed mbossenbroek closed 10 years ago

mbossenbroek commented 10 years ago

Add parquet support to PigPen. See https://github.com/apache/incubator-parquet-mr

Thanks @mping for the idea and a lot of the initial work on this one.

Usage:

(require '[pigpen.core :as pig]
         '[pigpen.parquet.core :as pig-pq])

(->>
  (pig/return [{:x 1 :y "a"}
               {:x 2 :y "b"}
               {:x 3 :y "c"}])
  (pig-pq/store-parquet "test.pq" {:x :int, :y :chararray})
  (pig/dump))

(->>
  (pig-pq/load-parquet "test.pq/part-m-00001.parquet" {:x :int, :y :chararray})
  (pig/dump))

=> [{:y "a", :x 1} {:y "b", :x 2} {:y "c", :x 3}]

Currently this requires a schema to be passed to both the load and store commands - hopefully this can be removed in the future, but for now the Pig Parquet loader requires it.

This also changes how local-mode is done for load and store commands. The new version uses PigPenLocalLoader and PigPenLocalStorage protocols in pigpen-core/src/main/clojure/pigpen/local.clj to implement local versions of load/store commands.

I added a bunch of code to assist in running hadoop and pig classes directly. Hopefully these should be useful when adding new formats.

cc @daveray

mping commented 10 years ago

Great work @mbossenbroek thanks!