apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.15k stars 2.14k forks source link

empty partition folders after delete the data or drop the table #9956

Open Tonylin1998 opened 6 months ago

Tonylin1998 commented 6 months ago

Query engine

spark

Question

I was using Iceberg with PySpark, and using JDBC catalog, and set warehouse to GCS

I creata a table using date as partition key. I write some data into table, and decide to delete date=20240220, so I

spark.sql(f"DELETE FROM {iceberg_table} WHERE date = '20241220'")
spark.sql(f"CALL {catalog_name}.system.expire_snapshots('{iceberg_table}')")

I find that the parquet file under date=20240220 is deleted, but the folder date=20240220 still remain

Also the same, when I drop the table using

spark.sql(f"DROP TABLE {iceberg_table} PURGE")

the data will be deleted, but all the partition folders will still remain These behavior cause many empty folders in my gcs, I wonder if there is any way I can do in iceberg to prevent this from happening?

christianb93 commented 3 months ago

I can confirm the same behaviour with Minio instead of GCS - when dropping an Iceberg table in SparkSQL using DROP TABLE ... PURGE, the data files are removed but the directory structure is not cleaned up

gaborkaszab commented 1 month ago

I believe Iceberg does this on purpose. The reason is that it's feasible with Iceberg to make multiple tables sharing the same location. So when you drop a table (or a partition) it's not safe to drop the entire folder because another table might have files in it (or might want to put files into it later on).