hannesmiller commented 7 years ago

Overview

The following EEL usage patterns describe reading and writing data to and from various file formats and storage systems.

A typical use case is sourcing data from a RDBMS system like Oracle to Hive via the JDBCSource and HiveSink respectively - a trivial exercise (no more than a dozen lines of code) using the EEL Scala API.

Note the below source -> sink patterns examples are NOT completely exhaustive.

EEL

The API supports most of the native types for a given storage file format for sources and sinks
The API is a thin abstraction layer over all the native storage file format APIs such Parquet, Orc and Hive.
APIs such as Kite are bound to the primitive types supported by the underlying AVRO schema – at the time of this writing AVRO schema doesn’t directly support Decimals and Timestamps – if the underlying native API (Parquet, Orc, etc…) supports it then EEL will support it.
Parallelism is at the sink level and controlled by configuring a number of threads - for each assigned thread the sink writes out a part file (a portion of the data).
Only a single database connection is required for a JDBCSource economizing on scarse database connection resources, e.g. if Sqoop is configured with 20 mappers then 20 database connections are allocated on the server.
Hive Partition keys are inferred by mapping the source (e.g. JDBCSource) column names to partition keys defined in the Hive Metastore.
The Hive sink automatically performs partition splitting of the data into their respective file buckets - no additional logic required.
Reduced startup time due to no dependency on the YARN resource manager.
Several out-of-the-box support for Source -> Sink combinations without requiring an intermediate transformation layer
Transformation is also supported via the EEL map function on the source’s underlying frame.
Unlike Kite no external metadata required for EEL datasets - Kite holds additional metadata describing its dataset on disk (HDFS)
The upcoming EEL 1.2 release will have an EEL CLI supporting common tasks such as importing/exporting data, Hive DDL generation, Data Compaction, etc...

sksamuel commented 7 years ago

I've merged your docs into the main readme, and removed from this issue the bits that have been done.

hannesmiller commented 7 years ago

Maps in Parquet

EEL supports Parquet MAPS of any primitive type including structs. The following example extends the previous examples by making PHONE_NUMBERS a Map:

The key to the Map is a string - for the following examples keys are: home,mobile
The value is simply the phone number

Writing with a MAP

    val parquetFilePath = new Path("hdfs://nameservice1/client/eel_map/person.parquet")
    implicit val hadoopConfiguration = new Configuration()
    implicit val hadoopFileSystem = FileSystem.get(hadoopConfiguration) 
   // Create the schema with a STRUCT and an ARRAY
    val personDetailsStruct = Field.createStructField("PERSON_DETAILS",
      Seq(
        Field("NAME", StringType),
        Field("AGE", IntType.Signed),
        Field("SALARY", DecimalType(Precision(38), Scale(5))),
        Field("CREATION_TIME", TimestampMillisType)
      )
    )
    val schema = StructType(personDetailsStruct, Field("PHONE_NUMBERS", MapType(StringType, StringType)))

    // Step 3: Create 3 rows
    val rows = Vector(
      Vector(Vector("Fred", 50, BigDecimal("50000.99000"), new Timestamp(System.currentTimeMillis())), Map("home" -> "322", "mobile" -> "987")),
      Vector(Vector("Gary", 50, BigDecimal("20000.34000"), new Timestamp(System.currentTimeMillis())), Map("home" -> "145", "mobile" -> "082")),
      Vector(Vector("Alice", 50, BigDecimal("99999.98000"), new Timestamp(System.currentTimeMillis())), Map("home" -> "534", "mobile" -> "129"))
    )

    // Write the rows
    Frame.fromValues(schema, rows)
      .to(ParquetSink(parquetFilePath))

If you have the parquet-tools installed on your system you can look at its native schema like so:

$ parquet-tools schema person.parquet
message eel_schema {
optional group PERSON_DETAILS {
optional binary NAME (UTF8);
optional int32 AGE;
optional fixed_len_byte_array(16) SALARY (DECIMAL(38,5));
optional int96 CREATION_TIME;
}
optional group PHONE_NUMBERS (MAP) {
repeated group key_value {
  required binary key (UTF8);
  optional binary value (UTF8);
}
}
}

Notice PHONE_NUMBERS is represented as a MAP (optional group) in Parquet consisting of a repeated group whose fields are key and value of type UTF8 (string).

Read back the rows via ParquetSource

    ParquetSource(parquetFilePath)
      .toFrame()
      .collect()
      .foreach(row => println(row))

The results

[PERSON_DETAILS = WrappedArray(Fred, 50, 50000.99000, 2017-02-03 11:29:52.56),PHONE_NUMBERS = Map(home -> 322, mobile -> 987)] [PERSON_DETAILS = WrappedArray(Gary, 50, 20000.34000, 2017-02-03 11:29:52.56),PHONE_NUMBERS = Map(home -> 145, mobile -> 082)] [PERSON_DETAILS = WrappedArray(Alice, 50, 99999.98000, 2017-02-03 11:29:52.56),PHONE_NUMBERS = Map(home -> 534, mobile -> 129)]

Looking at the Parquet file through Hive

On the Parquet file just written we can create a Hive External table pointing at the HDFS location of the file.

CREATE EXTERNAL TABLE IF NOT EXISTS `eel_test.struct_map_person_phone`(
   PERSON_DETAILS STRUCT<NAME:String, AGE:Int, SALARY:decimal(38,5), CREATION_TIME:TIMESTAMP>,
   PHONE_NUMBERS Map<String,String>
)
ROW FORMAT SERDE
   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/client/eel_map';

The location /client/eel_map is the root directory of where all the files live - in this case its the root of folder of the Parquet write

Here's a Hive session show the select:

hive> select * from eel_test.struct_map_person_phone;
OK
{"NAME":"Fred","AGE":50,"SALARY":50000.99,"CREATION_TIME":"2017-02-03 12:29:52.56"}     {"home":"322","mobile":"987"}
{"NAME":"Gary","AGE":50,"SALARY":20000.34,"CREATION_TIME":"2017-02-03 12:29:52.56"}     {"home":"145","mobile":"082"}
{"NAME":"Alice","AGE":50,"SALARY":99999.98,"CREATION_TIME":"2017-02-03 12:29:52.56"}    {"home":"534","mobile":"129"}
Time taken: 1.27 seconds, Fetched: 3 row(s)
hive>

Here's another Hive query asking for Alice and Gary's age and HOME phone number:

hive> select person_details.name, person_details.age, phone_numbers['home']
    > from eel_test.struct_map_person_phone
    > where person_details.name in ('Alice', 'Gary' );
OK
Gary    50      145
Alice   50      534
Time taken: 0.183 seconds, Fetched: 2 row(s)
hive>

HiveQL has some nice features for cracking nested types - the query returns scalar values for name and age in the person_details structure and HOME phone numbers from the phone_numbers MAP.
The same query is supported in Spark via HiveContext or SparkSession in version >= 2.x

What if I want to look at all MOBILE phone number:

hive> select person_details.name, person_details.age, phone_numbers['mobile']
    > from eel_test.struct_map_person_phone;
OK
Fred    50      987
Gary    50      082
Alice   50      129
Time taken: 0.079 seconds, Fetched: 3 row(s)
hive>

To retrieve a specific map element, HiveQL requires the column key string, e.g. phone_numbers['mobile']

Query to show name, age, home number and mobile number from the phone_numbers map

hive> select person_details.name, person_details.age, phone_numbers['home'], phone_numbers['mobile']
    > from eel_test.struct_map_person_phone;
OK
Fred    50      322     987
Gary    50      145     082
Alice   50      534     129
Time taken: 0.076 seconds, Fetched: 3 row(s)
hive>

hannesmiller commented 7 years ago

Sam could you merge in the last comment about MAPS into the MD - should come straight after ARRAY section.

51zero / eel-sdk

EEL Documentation Draft #221

Overview

EEL

Maps in Parquet

Writing with a MAP

Read back the rows via ParquetSource

The results

Looking at the Parquet file through Hive

Here's a Hive session show the select:

Here's another Hive query asking for Alice and Gary's age and HOME phone number:

What if I want to look at all MOBILE phone number:

Query to show name, age, home number and mobile number from the phone_numbers map