Netflix / PigPen

Map-Reduce for Clojure
Apache License 2.0
565 stars 55 forks source link

raw/load$ should omit the 'AS' clause if fields are empty #26

Closed mping closed 10 years ago

mping commented 10 years ago

Use case is using loaders that support schema such as parquet. In pig, any of this works:

RAW_DATA = LOAD 'parquet.gz/' USING parquet.pig.ParquetLoader();
--or
RAW_DATA = LOAD 'parquet.gz/' USING parquet.pig.ParquetLoader('contentHost:chararray');
--or
RAW_DATA = LOAD 'parquet.gz/' USING parquet.pig.ParquetLoader('contentHost:chararray') AS (contentHost:chararray);
--
DESCRIBE RAW_DATA; -- will work properly with any since we have metadata

but an empty array

(raw/load$
          location
          '[] ;; these are the fields this loader returns
          storage
          opts)
        (raw/bind$
...

generates this:

load20 = LOAD '/path/to/data/'
    USING MyComplexStorage('name', 'address', 'phone')
    AS ();

Since the loader handles the schema, there's no need for the AS clause.

mbossenbroek commented 10 years ago

Add {:implicit-schema true} to the opts. That should prevent it from using a schema in the script.

-Matt

On Thursday, April 3, 2014 at 11:20 AM, Miguel Ping wrote:

Use case is using loaders that support schema such as parquet. In pig, any of this works: RAW_DATA = LOAD 'parquet.gz/' USING parquet.pig.ParquetLoader(); --or RAW_DATA = LOAD 'parquet.gz/' USING parquet.pig.ParquetLoader('contentHost:chararray'); --or RAW_DATA = LOAD 'parquet.gz/' USING parquet.pig.ParquetLoader('contentHost:chararray') AS (contentHost:chararray); -- DESCRIBE RAW_DATA; -- will work properly with any since we have metadata
but an empty array
(raw/load$ location '[] ;; these are the fields this loader returns storage opts) (raw/bind$ ...
generates this: load20 = LOAD '/path/to/data/' USING MyComplexStorage('name', 'address', 'phone') AS ();
Since the loader handles the schema, there's no need for the AS clause.

— Reply to this email directly or view it on GitHub (https://github.com/Netflix/PigPen/issues/26).

mbossenbroek commented 10 years ago

You'll still need to tell it what fields the ParquetLoader will return, so that it can reference them in the next command. PigPen doesn't do any interrogation of that code, so it will think that the loader isn't returning any usable fields and the next operation won't be able to do anything.