aws-samples / emr-serverless-samples

Example code for running Spark and Hive jobs on EMR Serverless.
https://aws.amazon.com/emr/serverless/
MIT No Attribution
150 stars 74 forks source link

Configure input format and split size in extreme_weather.sql #30

Closed ganeshashree closed 2 years ago

ganeshashree commented 2 years ago

Issue #, if available:

Description of changes:

extreme_weather.sql file in Hive examples has the Hive query that reads 10843 small text data files with header. EMR versions >= 6.6.0 have the support to split text files with header/footer (Ref: HIVE-21924). With this support and With default input format (org.apache.hadoop.hive.ql.io.HiveInputFormat), a single thread in Tez AM reads all the data files during split computation. Therefore, split computation takes ~1.5 hrs for this query in EMR versions >= 6.6.0. Using CombineHiveInputFormat and configuring the split size solves this problem.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

ganeshashree commented 2 years ago

@dacort Please review this pull request.