Update the glue__create_tmp_table_as macro so that the file format uses parquet instead of flat file.
Describe alternatives you've considered
I have researched methods to decrease the amount of files created as the result of an insert overwrite incremental operation but none have been successful, except with the change suggested above. Use of the AQE to coalesce files with a configuration option such as spark.sql.adaptive.coalescePartitions.minPartitionSize or spark.sql.adaptive.coalescePartitions.minPartitionNum will not work as the final insert operation does not use shuffle partitions.
Additional context
Internal testing decreased the number of files created in a new partition by about 4x when the temporary table was created with parquet. Examining execution plans in SparkUI showed a lesser number of tasks generated on the scan from the temporary table during the insert into the base table. The tasks also packed in more rows versus when the temporary table was flatfile based.
Flat file based: 896 tasks w/ ~125k records each
Parquet based: 224 tasks w/ ~468k records each
Who will this benefit?
This should benefit anyone that utilizes insert overwrite incrementals.
Describe the feature
Update the
glue__create_tmp_table_as
macro so that the file format uses parquet instead of flat file.Describe alternatives you've considered
I have researched methods to decrease the amount of files created as the result of an
insert overwrite
incremental operation but none have been successful, except with the change suggested above. Use of the AQE to coalesce files with a configuration option such asspark.sql.adaptive.coalescePartitions.minPartitionSize
orspark.sql.adaptive.coalescePartitions.minPartitionNum
will not work as the final insert operation does not use shuffle partitions.Additional context
Internal testing decreased the number of files created in a new partition by about 4x when the temporary table was created with parquet. Examining execution plans in SparkUI showed a lesser number of tasks generated on the scan from the temporary table during the insert into the base table. The tasks also packed in more rows versus when the temporary table was flatfile based. Flat file based: 896 tasks w/ ~125k records each Parquet based: 224 tasks w/ ~468k records each
Who will this benefit?
This should benefit anyone that utilizes
insert overwrite
incrementals.Are you interested in contributing this feature?
I need to determine if I can submit a PR.