51zero / eel-sdk

Big Data Toolkit for the JVM
Apache License 2.0
145 stars 35 forks source link

EEL Documentation Draft #221

Closed hannesmiller closed 6 years ago

hannesmiller commented 7 years ago

Overview

The following EEL usage patterns describe reading and writing data to and from various file formats and storage systems.

A typical use case is sourcing data from a RDBMS system like Oracle to Hive via the JDBCSource and HiveSink respectively - a trivial exercise (no more than a dozen lines of code) using the EEL Scala API.

Note the below source -> sink patterns examples are NOT completely exhaustive.

EEL

sksamuel commented 7 years ago

I've merged your docs into the main readme, and removed from this issue the bits that have been done.

hannesmiller commented 7 years ago

Maps in Parquet

EEL supports Parquet MAPS of any primitive type including structs. The following example extends the previous examples by making PHONE_NUMBERS a Map:

Writing with a MAP

    val parquetFilePath = new Path("hdfs://nameservice1/client/eel_map/person.parquet")
    implicit val hadoopConfiguration = new Configuration()
    implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) 
   // Create the schema with a STRUCT and an ARRAY
    val personDetailsStruct = Field.createStructField("PERSON_DETAILS",
      Seq(
        Field("NAME", StringType),
        Field("AGE", IntType.Signed),
        Field("SALARY", DecimalType(Precision(38), Scale(5))),
        Field("CREATION_TIME", TimestampMillisType)
      )
    )
    val schema = StructType(personDetailsStruct, Field("PHONE_NUMBERS", MapType(StringType, StringType)))

    // Step 3: Create 3 rows
    val rows = Vector(
      Vector(Vector("Fred", 50, BigDecimal("50000.99000"), new Timestamp(System.currentTimeMillis())), Map("home" -> "322", "mobile" -> "987")),
      Vector(Vector("Gary", 50, BigDecimal("20000.34000"), new Timestamp(System.currentTimeMillis())), Map("home" -> "145", "mobile" -> "082")),
      Vector(Vector("Alice", 50, BigDecimal("99999.98000"), new Timestamp(System.currentTimeMillis())), Map("home" -> "534", "mobile" -> "129"))
    )

    // Write the rows
    Frame.fromValues(schema, rows)
      .to(ParquetSink(parquetFilePath))

Read back the rows via ParquetSource

    ParquetSource(parquetFilePath)
      .toFrame()
      .collect()
      .foreach(row => println(row))

The results

[PERSON_DETAILS = WrappedArray(Fred, 50, 50000.99000, 2017-02-03 11:29:52.56),PHONE_NUMBERS = Map(home -> 322, mobile -> 987)] [PERSON_DETAILS = WrappedArray(Gary, 50, 20000.34000, 2017-02-03 11:29:52.56),PHONE_NUMBERS = Map(home -> 145, mobile -> 082)] [PERSON_DETAILS = WrappedArray(Alice, 50, 99999.98000, 2017-02-03 11:29:52.56),PHONE_NUMBERS = Map(home -> 534, mobile -> 129)]

Looking at the Parquet file through Hive

On the Parquet file just written we can create a Hive External table pointing at the HDFS location of the file.

CREATE EXTERNAL TABLE IF NOT EXISTS `eel_test.struct_map_person_phone`(
   PERSON_DETAILS STRUCT<NAME:String, AGE:Int, SALARY:decimal(38,5), CREATION_TIME:TIMESTAMP>,
   PHONE_NUMBERS Map<String,String>
)
ROW FORMAT SERDE
   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/client/eel_map';

Here's a Hive session show the select:

hive> select * from eel_test.struct_map_person_phone;
OK
{"NAME":"Fred","AGE":50,"SALARY":50000.99,"CREATION_TIME":"2017-02-03 12:29:52.56"}     {"home":"322","mobile":"987"}
{"NAME":"Gary","AGE":50,"SALARY":20000.34,"CREATION_TIME":"2017-02-03 12:29:52.56"}     {"home":"145","mobile":"082"}
{"NAME":"Alice","AGE":50,"SALARY":99999.98,"CREATION_TIME":"2017-02-03 12:29:52.56"}    {"home":"534","mobile":"129"}
Time taken: 1.27 seconds, Fetched: 3 row(s)
hive>

Here's another Hive query asking for Alice and Gary's age and HOME phone number:

hive> select person_details.name, person_details.age, phone_numbers['home']
    > from eel_test.struct_map_person_phone
    > where person_details.name in ('Alice', 'Gary' );
OK
Gary    50      145
Alice   50      534
Time taken: 0.183 seconds, Fetched: 2 row(s)
hive>

What if I want to look at all MOBILE phone number:

hive> select person_details.name, person_details.age, phone_numbers['mobile']
    > from eel_test.struct_map_person_phone;
OK
Fred    50      987
Gary    50      082
Alice   50      129
Time taken: 0.079 seconds, Fetched: 3 row(s)
hive>

Query to show name, age, home number and mobile number from the phone_numbers map

hive> select person_details.name, person_details.age, phone_numbers['home'], phone_numbers['mobile']
    > from eel_test.struct_map_person_phone;
OK
Fred    50      322     987
Gary    50      145     082
Alice   50      534     129
Time taken: 0.076 seconds, Fetched: 3 row(s)
hive>
hannesmiller commented 7 years ago

Sam could you merge in the last comment about MAPS into the MD - should come straight after ARRAY section.