Open prabavnc opened 2 years ago
Hi, prabavnc. I faced a similar issue. Due to the distributed nature of athena, when you write data into a bucket with Athena, it tends to produce a lot of small files by default, giving a lot of I/O and network work. One way to surpass this issue is to use the bucketed_by and bucket_count configs
Example:
{{ config(materialized='table',format='parquet',external_location=s3://{{ bucket_name }} / {{ invocation_id }}',bucketed_by=['some_column'],bucket_count=1) }}
select *
from bigger_table
join other_big_table
where some_logic
Instead of athena producing a lot of small files, It will produce accordingly to the number you defined at bucket_count.
Hope I helped you with that. I took a long time before realizing this detail.
Hey,
Recently we are facing s3 throttling issues while creating external tables which basically select data from another big table like
Error committing write to Hive com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown)
Have anyone faced this issue in DBT and how have you guys solved it
Thanks in advance