apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.51k stars 2.25k forks source link

Spark-hive catalog drop table XX purge not use purge code in HiveCatalog.dropTable() #11484

Open zhangwl9 opened 2 weeks ago

zhangwl9 commented 2 weeks ago

Query engine

iceberg-1.4.3+Spark-3.3+HiveCatalog

Question

1、background

In the HiveCatalog#dropTable method, including an option boolean purge to perform the purge operation. Additionally, in the SparkCatalog#purgeTable method, utilize a Spark action to implement the purge operation separately, rather than directly executing the purge logic from HiveCatalog#dropTable.

2、question

Why doesn't Spark with HiveCatalog use the existing purge code from HiveCatalog#dropTable for its purge operation, but instead implements it separately?

amabilee commented 2 weeks ago

Hey there!

The reason Spark with HiveCatalog doesn't use the existing purge code from HiveCatalog#dropTable for its purge operation is primarily due to performance and storage considerations.

When you use the PURGE option in Hive, it immediately deletes the underlying data files without moving them to a temporary holding area like the HDFS trashcan. This can be crucial for performance, storage, and security reasons, especially when dealing with large datasets or sensitive information1.

However, when Spark SQL performs a DROP TABLE operation with the PURGE clause, it doesn't pass this clause along to the Hive statement that performs the drop table operation behind the scenes. Therefore, the purge behavior isn't applied as expected.

To ensure the purge operation is performed correctly, it's recommended to execute the DROP TABLE operation directly in Hive, for example, through the Hive CLI (command-line interface), rather than through Spark SQL.

Here is the reference: https://docs.cloudera.com/runtime/latest/developing-spark-applications/topics/spark-sql-drop-table-purge-considerations.html