extreme_weather.sql file in Hive examples has the Hive query that reads 10843 small text data files with header. EMR versions >= 6.6.0 have the support to split text files with header/footer (Ref: HIVE-21924). With this support and With default input format (org.apache.hadoop.hive.ql.io.HiveInputFormat), a single thread in Tez AM reads all the data files during split computation. Therefore, split computation takes ~1.5 hrs for this query in EMR versions >= 6.6.0. Using CombineHiveInputFormat and configuring the split size solves this problem.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Issue #, if available:
Description of changes:
extreme_weather.sql
file in Hive examples has the Hive query that reads 10843 small text data files with header. EMR versions >= 6.6.0 have the support to split text files with header/footer (Ref: HIVE-21924). With this support and With default input format (org.apache.hadoop.hive.ql.io.HiveInputFormat
), a single thread in Tez AM reads all the data files during split computation. Therefore, split computation takes ~1.5 hrs for this query in EMR versions >= 6.6.0. Using CombineHiveInputFormat and configuring the split size solves this problem.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.