apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
5.94k stars 2.08k forks source link

Spark can not delete table metadata and data when drop table #9990

Open tomfans opened 4 months ago

tomfans commented 4 months ago

spark 3.3.2 , iceberg 1.4 , metadata managed by HMS,

the table which created by spark/iceberg, and metadata managed by HMS, when drop table with purge, the data/metadata directories are still exist.

if metadata managed by HDFS, it works

how to handle this ?

manuzhang commented 4 months ago

Which catalogs do you use in those two cases? Can you share the configs?

tomfans commented 4 months ago

org.apache.iceberg.spark.SparkCatalog and HMS, if i use HMS as catalog store , i can't delete table directories when i drop table, even i drop table with purge. the software version is: spark 3.3.2, hive 2.3.9, iceberg jar is:1.4.0.

the iceberg config as below: #############iceberg#################### spark.sql.catalog.iceberg_prod = org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.iceberg_prod.type = hive spark.sql.catalog.iceberg_prod.uri = thrift://hcshadoop04.dev.xxx.cn:9083,thrift://hcshadoop05.dev.xxx.cn:9083

if i use HDFS as the catalog, it works fine. the config as below: spark.sql.catalog.iceberg_prod = org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.iceberg_prod.type = hadoop spark.sql.catalog.iceberg_prod.warehouse = hdfs://nameservice/tmp/spark-iceberg

tomfans commented 4 months ago

and i have checked other comments on this kind issue , why can't delete table directories when drop table , it's because the table is a external table when created by spark, and giving some tricks like alter table from external table to managed table ,but it still doesn'work.

manuzhang commented 4 months ago

alter table from external table to managed table

This only works with Hive CLI.

tomfans commented 4 months ago

i just want to confirm while i use spark with iceberg catalog "org.apache.iceberg.spark.SparkCatalog" and HMS, if it's normal or not when it can't delete table directories after table dropped?

if this is normal, there are a lot table directores will keeped in data warehouse '/user/hive/warehouse', will have many many small files and gabage.

wanghualei commented 4 months ago

I meet this problem also. when user "drop table purge", it delete files in data and meta directory,but ,the table driectory still exits, in normal sence , the directory should be delete also.

tomfans commented 4 months ago

I meet this problem also. when user "drop table purge", it delete files in data and meta directory,but ,the table driectory still exits, in normal sence , the directory should be delete also.

is it spark, right ? it seems there are some diffirences comparing to mine. the phenomenon of mine is it just delete metadata files when drop table, the data directory and data files , metadata directory and table directory still exists.

manuzhang commented 4 months ago

@tomfans If you mean empty table directories are left over, I can confirm that's behavior for HiveCatalog. It removes the table record from metastore, and deletes all referenced metadata and data. The rationale I can see is Iceberg cannot assume all files under the directory belong to the table.

As for the HadoopCatalog deleting directory is the only solution since the directory is the catalog itself.

tomfans commented 4 months ago

deleting

yes, it's HiveCatalog, hadoopcatalog works fine. you mean this is normal operation for HiveCatalog to keep these data/metadata/table directories when drop table ?

if so , when we drop tables , it will keep a lot of table directories , how to handle this kind problem ?

manuzhang commented 4 months ago

You may create external auto-purge process if you are sure these directories are safe to delete.

wanghualei commented 3 months ago

@manuzhang

@tomfans If you mean empty table directories are left over, I can confirm that's behavior for HiveCatalog. It removes the table record from metastore, and deletes all referenced metadata and data. The rationale I can see is Iceberg cannot assume all files under the directory belong to the table.

As for the HadoopCatalog deleting directory is the only solution since the directory is the catalog itself.

why not ? assume all file under the directory belong to the table ,it is normal sence.

wanghualei commented 3 months ago

Is it related to object storage? There is no directory relationship.

manuzhang commented 3 months ago

For example, I can create a table B with location under that of table A. I don't want to delete table B when dropping table A.

wanghualei commented 3 months ago

For example, I can create a table B with location under that of table A. I don't want to delete table B when dropping table A.

Generally speaking, this situation should not occur. Because it undermines table level integrity, it can be constrained by encoding specifications. People are more accustomed to having one table and one directory.

wanghualei commented 3 months ago

For example, I can create a table B with location under that of table A. I don't want to delete table B when dropping table A.

If we reason like this, does HadoopCatalog also have cross use of tables and directories.

manuzhang commented 3 months ago

Because it undermines table level integrity

Iceberg manages table integrity. What can be improved is to offer options to delete directory when users know it's safe to do so. HadoopCatalog is not recommended to use as discussed in this mail thread

wanghualei commented 3 months ago

Because it undermines table level integrity

Iceberg manages table integrity. What can be improved is to offer options to delete directory when users know it's safe to do so. HadoopCatalog is not recommended to use as discussed in this mail thread

We strongly recommend adding relevant options to clear the table directory. Thank you very much.

wForget commented 3 months ago

I also encountered the same problem when using SparkSessionCatalog to delete a non-iceberg hive table.

@manuzhang Iceberg HiveCatalog allows deletion of non-iceberg tables, is this expected behavior?

tomfans commented 3 months ago

i don't think this is related to session or non-iceberg table, if it's HMS as catalog, spark can't delete the data files and directories of the table. if HDFS as catalog, no problem.

wForget commented 3 months ago

i don't think this is related to session or non-iceberg table, if it's HMS as catalog, spark can't delete the data files and directories of the table. if HDFS as catalog, no problem.

If Iceberg HiveCatalog only handles iceberg tables, non-iceberg tables will fall back to sparkSessionCatalog. Is that more reasonable?

https://github.com/apache/iceberg/blob/96793bf621524f57e99cd19d410cde734bf588eb/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java#L281