S3 Throttling for Big tables

Tomme / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)

Apache License 2.0

140 stars 79 forks source link

create table new_table with ( external_location='s3://{{ bucket_name }} / {{ invocation_id }}', format='parquet' ) as select * from bigger_table join other_big_table where some_logic

Hi, prabavnc. I faced a similar issue. Due to the distributed nature of athena, when you write data into a bucket with Athena, it tends to produce a lot of small files by default, giving a lot of I/O and network work. One way to surpass this issue is to use the bucketed_by and bucket_count configs

Example:

{{ config(materialized='table',format='parquet',external_location=s3://{{ bucket_name }} / {{ invocation_id }}',bucketed_by=['some_column'],bucket_count=1) }}

select *
from bigger_table
join other_big_table
where some_logic

Instead of athena producing a lot of small files, It will produce accordingly to the number you defined at bucket_count.

Hope I helped you with that. I took a long time before realizing this detail.

Tomme / dbt-athena

S3 Throttling for Big tables #91