apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.39k stars 2.21k forks source link

Iceberg table maintenance/compaction within AWS #5997

Closed vshel closed 1 year ago

vshel commented 2 years ago

Query engine

Spark3

Question

Hello, I have a ~6TB iceberg table with ~10,000 partitions within S3 and I am using Glue catalog, what is the correct way of running compaction on such a table?

From documentation: https://iceberg.apache.org/docs/latest/maintenance/ I can run:

SparkActions
    .get()
    .rewriteDataFiles(table)
    .filter(Expressions.equal("date", "2020-08-18"))
    .option("target-file-size-bytes", Long.toString(500 * 1024 * 1024)) // 500 MB
    .execute();

This is going to execute on a single aws instance, how do I scale this to many instances for the compaction process to run in parallel on many partitions at once, is there an out of the box support for this within AWS? Do I need to create my own spark cluster? I am not familiar with spark at this point.

My biggest concern: Additionally, the table is constantly updated, am I supposed to pause all updates until compaction finishes?

Thank you.

ismailsimsek commented 2 years ago

@vshel any reason you are not using Athena to do compaction?

vshel commented 2 years ago

@ismailsimsek I tried running OPTIMIZE with athena on a partition with ~25 000 files totalling 2.6GB (so pretty small dataset), it failed with an internal error after 8 minutes, I created a support ticket for AWS to investigate, but it's not looking promising now, considering whole table dataset is 6TB.

Additionally, after experimenting, Athena read performance is horrible unless I do a compaction, I tested a small 25MB dataset, it takes athena 50 seconds to get 100 000 records out of this iceberg 25MB table or to do a COUNT(*), and after I do compaction it takes 8 seconds for athena to do retrival and count operations. All files in the dataset have a corresponding delete, because I am doing upserts of streaming data. So, it looks like upserting (delete + write) slows down athena read performance, compaction fixes it as it removes deletes. I tested performances without deletes by doing just writes during streaming of this 25MB dataset and read performance was 8 seconds even without running compaction.

So, Iceberg athena read performace is looking to be very slow, considering non-iceberg athena tables that span 60GB of data can run COUNT(*) in just 4 seconds, compared to Iceberg's 8 seconds for 25MB.

Samrose-Ahmed commented 1 year ago

I would recommend running a Spark job. An AWS Glue job is the easiest to get started but considering you're running this once, it'll likely be cheaper to run on EMR (serverless or provisioned). Also, Spark/EMR doesn't run on a single instance, it parallelizes across nodes.

In the future, since you're doing streaming appends/inserts I would recommend doing regular table maintenance, so you don't end up in this situation. You can check this blog post : Automated Iceberg table maintenance on AWS for how we do it in Matano, but its fairly simple you need to regularly run compaction.

maswin commented 1 year ago

You have to set this setting to a higher value (default is 1) to run re-writes in multiple instances in parallel:

max-concurrent-file-group-rewrites

https://iceberg.apache.org/javadoc/1.1.0/org/apache/iceberg/actions/RewriteDataFiles.html#MAX_CONCURRENT_FILE_GROUP_REWRITES

Not sure why this is not documented.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 1 year ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'