aws-samples / dbt-glue

This repository contains the dbt-glue adapter
Apache License 2.0
101 stars 69 forks source link

Temporary table file format to parquet #409

Open johnnyC9000 opened 4 months ago

johnnyC9000 commented 4 months ago

Describe the feature

Update the glue__create_tmp_table_as macro so that the file format uses parquet instead of flat file.

Describe alternatives you've considered

I have researched methods to decrease the amount of files created as the result of an insert overwrite incremental operation but none have been successful, except with the change suggested above. Use of the AQE to coalesce files with a configuration option such as spark.sql.adaptive.coalescePartitions.minPartitionSize or spark.sql.adaptive.coalescePartitions.minPartitionNum will not work as the final insert operation does not use shuffle partitions.

Additional context

Internal testing decreased the number of files created in a new partition by about 4x when the temporary table was created with parquet. Examining execution plans in SparkUI showed a lesser number of tasks generated on the scan from the temporary table during the insert into the base table. The tasks also packed in more rows versus when the temporary table was flatfile based. Flat file based: 896 tasks w/ ~125k records each Parquet based: 224 tasks w/ ~468k records each

Who will this benefit?

This should benefit anyone that utilizes insert overwrite incrementals.

Are you interested in contributing this feature?

I need to determine if I can submit a PR.