Closed hannesmiller closed 6 years ago
I've merged your docs into the main readme, and removed from this issue the bits that have been done.
EEL supports Parquet MAPS of any primitive type including structs. The following example extends the previous examples by making PHONE_NUMBERS a Map:
val parquetFilePath = new Path("hdfs://nameservice1/client/eel_map/person.parquet")
implicit val hadoopConfiguration = new Configuration()
implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration)
// Create the schema with a STRUCT and an ARRAY
val personDetailsStruct = Field.createStructField("PERSON_DETAILS",
Seq(
Field("NAME", StringType),
Field("AGE", IntType.Signed),
Field("SALARY", DecimalType(Precision(38), Scale(5))),
Field("CREATION_TIME", TimestampMillisType)
)
)
val schema = StructType(personDetailsStruct, Field("PHONE_NUMBERS", MapType(StringType, StringType)))
// Step 3: Create 3 rows
val rows = Vector(
Vector(Vector("Fred", 50, BigDecimal("50000.99000"), new Timestamp(System.currentTimeMillis())), Map("home" -> "322", "mobile" -> "987")),
Vector(Vector("Gary", 50, BigDecimal("20000.34000"), new Timestamp(System.currentTimeMillis())), Map("home" -> "145", "mobile" -> "082")),
Vector(Vector("Alice", 50, BigDecimal("99999.98000"), new Timestamp(System.currentTimeMillis())), Map("home" -> "534", "mobile" -> "129"))
)
// Write the rows
Frame.fromValues(schema, rows)
.to(ParquetSink(parquetFilePath))
$ parquet-tools schema person.parquet
message eel_schema {
optional group PERSON_DETAILS {
optional binary NAME (UTF8);
optional int32 AGE;
optional fixed_len_byte_array(16) SALARY (DECIMAL(38,5));
optional int96 CREATION_TIME;
}
optional group PHONE_NUMBERS (MAP) {
repeated group key_value {
required binary key (UTF8);
optional binary value (UTF8);
}
}
}
ParquetSource(parquetFilePath)
.toFrame()
.collect()
.foreach(row => println(row))
[PERSON_DETAILS = WrappedArray(Fred, 50, 50000.99000, 2017-02-03 11:29:52.56),PHONE_NUMBERS = Map(home -> 322, mobile -> 987)] [PERSON_DETAILS = WrappedArray(Gary, 50, 20000.34000, 2017-02-03 11:29:52.56),PHONE_NUMBERS = Map(home -> 145, mobile -> 082)] [PERSON_DETAILS = WrappedArray(Alice, 50, 99999.98000, 2017-02-03 11:29:52.56),PHONE_NUMBERS = Map(home -> 534, mobile -> 129)]
On the Parquet file just written we can create a Hive External table pointing at the HDFS location of the file.
CREATE EXTERNAL TABLE IF NOT EXISTS `eel_test.struct_map_person_phone`(
PERSON_DETAILS STRUCT<NAME:String, AGE:Int, SALARY:decimal(38,5), CREATION_TIME:TIMESTAMP>,
PHONE_NUMBERS Map<String,String>
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/client/eel_map';
hive> select * from eel_test.struct_map_person_phone;
OK
{"NAME":"Fred","AGE":50,"SALARY":50000.99,"CREATION_TIME":"2017-02-03 12:29:52.56"} {"home":"322","mobile":"987"}
{"NAME":"Gary","AGE":50,"SALARY":20000.34,"CREATION_TIME":"2017-02-03 12:29:52.56"} {"home":"145","mobile":"082"}
{"NAME":"Alice","AGE":50,"SALARY":99999.98,"CREATION_TIME":"2017-02-03 12:29:52.56"} {"home":"534","mobile":"129"}
Time taken: 1.27 seconds, Fetched: 3 row(s)
hive>
hive> select person_details.name, person_details.age, phone_numbers['home']
> from eel_test.struct_map_person_phone
> where person_details.name in ('Alice', 'Gary' );
OK
Gary 50 145
Alice 50 534
Time taken: 0.183 seconds, Fetched: 2 row(s)
hive>
hive> select person_details.name, person_details.age, phone_numbers['mobile']
> from eel_test.struct_map_person_phone;
OK
Fred 50 987
Gary 50 082
Alice 50 129
Time taken: 0.079 seconds, Fetched: 3 row(s)
hive>
hive> select person_details.name, person_details.age, phone_numbers['home'], phone_numbers['mobile']
> from eel_test.struct_map_person_phone;
OK
Fred 50 322 987
Gary 50 145 082
Alice 50 534 129
Time taken: 0.076 seconds, Fetched: 3 row(s)
hive>
Sam could you merge in the last comment about MAPS into the MD - should come straight after ARRAY section.
Overview
The following EEL usage patterns describe reading and writing data to and from various file formats and storage systems.
A typical use case is sourcing data from a RDBMS system like Oracle to Hive via the JDBCSource and HiveSink respectively - a trivial exercise (no more than a dozen lines of code) using the EEL Scala API.
Note the below source -> sink patterns examples are NOT completely exhaustive.
EEL