Tomme / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
Apache License 2.0
140 stars 79 forks source link

S3 Throttling for Big tables #91

Open prabavnc opened 2 years ago

prabavnc commented 2 years ago

Hey,

Recently we are facing s3 throttling issues while creating external tables which basically select data from another big table like

create table
    new_table

    with (
        external_location='s3://{{ bucket_name }} / {{ invocation_id }}',
        format='parquet'
    )
  as

select *
from bigger_table
join other_big_table
where some_logic

Error committing write to Hive com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown)

Have anyone faced this issue in DBT and how have you guys solved it

Thanks in advance

brianamaral commented 2 years ago

Hi, prabavnc. I faced a similar issue. Due to the distributed nature of athena, when you write data into a bucket with Athena, it tends to produce a lot of small files by default, giving a lot of I/O and network work. One way to surpass this issue is to use the bucketed_by and bucket_count configs

Example:

{{ config(materialized='table',format='parquet',external_location=s3://{{ bucket_name }} / {{ invocation_id }}',bucketed_by=['some_column'],bucket_count=1) }}

select *
from bigger_table
join other_big_table
where some_logic

Instead of athena producing a lot of small files, It will produce accordingly to the number you defined at bucket_count.

Hope I helped you with that. I took a long time before realizing this detail.