Open tomfans opened 4 months ago
Which catalogs do you use in those two cases? Can you share the configs?
org.apache.iceberg.spark.SparkCatalog and HMS, if i use HMS as catalog store , i can't delete table directories when i drop table, even i drop table with purge. the software version is: spark 3.3.2, hive 2.3.9, iceberg jar is:1.4.0.
the iceberg config as below: #############iceberg#################### spark.sql.catalog.iceberg_prod = org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.iceberg_prod.type = hive spark.sql.catalog.iceberg_prod.uri = thrift://hcshadoop04.dev.xxx.cn:9083,thrift://hcshadoop05.dev.xxx.cn:9083
if i use HDFS as the catalog, it works fine. the config as below: spark.sql.catalog.iceberg_prod = org.apache.iceberg.spark.SparkCatalog spark.sql.catalog.iceberg_prod.type = hadoop spark.sql.catalog.iceberg_prod.warehouse = hdfs://nameservice/tmp/spark-iceberg
and i have checked other comments on this kind issue , why can't delete table directories when drop table , it's because the table is a external table when created by spark, and giving some tricks like alter table from external table to managed table ,but it still doesn'work.
alter table from external table to managed table
This only works with Hive CLI.
i just want to confirm while i use spark with iceberg catalog "org.apache.iceberg.spark.SparkCatalog" and HMS, if it's normal or not when it can't delete table directories after table dropped?
if this is normal, there are a lot table directores will keeped in data warehouse '/user/hive/warehouse', will have many many small files and gabage.
I meet this problem also. when user "drop table purge", it delete files in data and meta directory,but ,the table driectory still exits, in normal sence , the directory should be delete also.
I meet this problem also. when user "drop table purge", it delete files in data and meta directory,but ,the table driectory still exits, in normal sence , the directory should be delete also.
is it spark, right ? it seems there are some diffirences comparing to mine. the phenomenon of mine is it just delete metadata files when drop table, the data directory and data files , metadata directory and table directory still exists.
@tomfans If you mean empty table directories are left over, I can confirm that's behavior for HiveCatalog
. It removes the table record from metastore, and deletes all referenced metadata and data. The rationale I can see is Iceberg cannot assume all files under the directory belong to the table.
As for the HadoopCatalog
deleting directory is the only solution since the directory is the catalog itself.
deleting
yes, it's HiveCatalog, hadoopcatalog works fine. you mean this is normal operation for HiveCatalog to keep these data/metadata/table directories when drop table ?
if so , when we drop tables , it will keep a lot of table directories , how to handle this kind problem ?
You may create external auto-purge process if you are sure these directories are safe to delete.
@manuzhang
@tomfans If you mean empty table directories are left over, I can confirm that's behavior for
HiveCatalog
. It removes the table record from metastore, and deletes all referenced metadata and data. The rationale I can see is Iceberg cannot assume all files under the directory belong to the table.As for the
HadoopCatalog
deleting directory is the only solution since the directory is the catalog itself.
why not ? assume all file under the directory belong to the table ,it is normal sence.
Is it related to object storage? There is no directory relationship.
For example, I can create a table B with location under that of table A. I don't want to delete table B when dropping table A.
For example, I can create a table B with location under that of table A. I don't want to delete table B when dropping table A.
Generally speaking, this situation should not occur. Because it undermines table level integrity, it can be constrained by encoding specifications. People are more accustomed to having one table and one directory.
For example, I can create a table B with location under that of table A. I don't want to delete table B when dropping table A.
If we reason like this, does HadoopCatalog also have cross use of tables and directories.
Because it undermines table level integrity
Iceberg manages table integrity. What can be improved is to offer options to delete directory when users know it's safe to do so. HadoopCatalog is not recommended to use as discussed in this mail thread
Because it undermines table level integrity
Iceberg manages table integrity. What can be improved is to offer options to delete directory when users know it's safe to do so. HadoopCatalog is not recommended to use as discussed in this mail thread
We strongly recommend adding relevant options to clear the table directory. Thank you very much.
I also encountered the same problem when using SparkSessionCatalog to delete a non-iceberg hive table.
@manuzhang Iceberg HiveCatalog allows deletion of non-iceberg tables, is this expected behavior?
i don't think this is related to session or non-iceberg table, if it's HMS as catalog, spark can't delete the data files and directories of the table. if HDFS as catalog, no problem.
i don't think this is related to session or non-iceberg table, if it's HMS as catalog, spark can't delete the data files and directories of the table. if HDFS as catalog, no problem.
If Iceberg HiveCatalog only handles iceberg tables, non-iceberg tables will fall back to sparkSessionCatalog. Is that more reasonable?
spark 3.3.2 , iceberg 1.4 , metadata managed by HMS,
the table which created by spark/iceberg, and metadata managed by HMS, when drop table with purge, the data/metadata directories are still exist.
if metadata managed by HDFS, it works
how to handle this ?