apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.15k stars 2.14k forks source link

Table maintenace procedure(expire_snapshots) not work as expceted #10907

Open toien opened 1 month ago

toien commented 1 month ago

Query engine

Spark SQL on AWS EMR(7.1.0)

Versions:

Question

First i create an iceberg table like:

spark-sql (test_db)> show create table my_catalog.test_db.dws_table;
CREATE TABLE my_catalog.test_db.dws_table (
  dt STRING NOT NULL,
  brand_code STRING NOT NULL,
  event_type STRING NOT NULL,
  sub_event_type STRING NOT NULL,
  success_count INT,
  failed_count INT)
USING iceberg
LOCATION 's3://xxx/test/test_db.db/dws_table'
TBLPROPERTIES (
  'current-snapshot-id' = '3745013875610091505',
  'format' = 'iceberg/parquet',
  'format-format' = '2',
  'format-version' = '2',
  'identifier-fields' = '[dt,brand_code,sub_event_type,event_type]',
  'write.metadata.delete-after-commit.enabled' = 'true',
  'write.metadata.previous-versions-max' = '5',
  'write.parquet.compression-codec' = 'zstd',
  'write.upsert.enabled' = 'true')

Flink streaming jobs will calc results and upsert into this table. so that would create many snapshots by Flink checkpoints:

spark-sql (test_db)> select COUNT(*) from my_catalog.test_db.dws_table.snapshots;
2130

Here is the problem: When I use Spark SQL do expire_snapshots, It DO cost time to execute this job

spark-sql (test_db)> CALL my_catalog.system.expire_snapshots(
                   >   table => 'test_db.dws_table',
                   >   retain_last => 5
                   > );
deleted_data_files_count        deleted_position_delete_files_count     deleted_equality_delete_files_count     deleted_manifest_files_count    deleted_manifest_lists_count    deleted_statistics_files_count
0       0       0       0       0       0
Time taken: 45.336 seconds, Fetched 1 row(s)

But nothing been deleted!

spark-sql (test_db)> select COUNT(*) from my_catalog.test_db.dws_table.snapshots;
2164

And data files on S3 still there.

Spark Job finished successfully: iceberg-maintenance-failed

The same problem occurs when call rewrite_data_files TOO, small data files are NOT been compacted(merged).

RussellSpitzer commented 1 month ago

Expire snapshots only removes data files which are no longer needed by any remaining snapshots. The output of your command shows no files were needed to be removed. Based on the fact that snapshot number seems to have increased in your second request I would be suspicious that you have snapshots older than default time for that command. Retain last is a "minimum" not a maximum, if your expire snapshots doesn't specifically state what the age limit is, it will only expire snapshots older than 5 days (I think, check the docs to be sure).

The bin-pack command you shows it only found 2 files to compact.

Most of the time when folks have this issue it is because they don't have enough small files in a given partition to trigger compaction. By default, the command will only compact files within a partition if there are 5 or more files in that partition that need compaction. See the docs for more info

toien commented 1 month ago

Snapshots number increased because Flink job still writing data to table.

In my opinion, it's better to clerify retain_last parameter's "minimum" function in doc:

Number of ancestor snapshots to preserve regardless of older_than.

Summary

After doing some tests, I finally start understanding iceberg's mainteinance procedures. Hope this help people are new to iceberg like me.

rewrite_data_files

Rewrite data files is a procedure reading source small files, compacting, and writing a new one. It won't delete old small files.

Data files, as leaf level of iceberg table layer, they belong manifest files. Deleting source small files will break its manifest file.

This procedure will optimize data files(usually merging) and create a new version(snapshot) of table.

rewrite_manifests

Unlike data files, rewrite_manifests will replace old ones.

This procedure will optimize manifest files(usually merging) and create a new version(snapshot) of table.

expire_snapshots

Always use older_than paramter.

If data files expected to be deleted still remains in S3 or HDFS, recheck metadata tables after executing procedure. They may be linked in manifests or entries.

Maintenance tips

Say we have a table upserting by flink jobs, which will create a lot data files and metadata. Hourly executing these would optimize iceberg table:

When it comes to partitioned table, say partition by day:

pvary commented 1 month ago

rewrite_manifests

Unlike data files, rewrite_manifests will replace old ones.

Actually, this procedure also just creates a new snapshot and keeps the old metadata files for the original snapshot. If you want to remove the old metadata files, you have to run the ExpireSnapshots procedure.

As you are using Flink to write data to an Iceberg table, you might want to follow https://github.com/orgs/apache/projects/358. This ongoing project aims to provide a Flink-specific solution for the problems mentioned above.