Open zhangwl9 opened 2 weeks ago
Hey there!
The reason Spark
with HiveCatalog
doesn't use the existing purge code from HiveCatalog#dropTable
for its purge operation is primarily due to performance and storage considerations.
When you use the PURGE
option in Hive
, it immediately deletes the underlying data files without moving them to a temporary holding area like the HDFS trashcan. This can be crucial for performance, storage, and security reasons, especially when dealing with large datasets or sensitive information1.
However, when Spark SQL
performs a DROP TABLE
operation with the PURGE
clause, it doesn't pass this clause along to the Hive statement that performs the drop table operation behind the scenes. Therefore, the purge behavior isn't applied as expected.
To ensure the purge operation is performed correctly, it's recommended to execute the DROP TABLE
operation directly in Hive, for example, through the Hive CLI (command-line interface), rather than through Spark SQL.
Here is the reference: https://docs.cloudera.com/runtime/latest/developing-spark-applications/topics/spark-sql-drop-table-purge-considerations.html
Query engine
iceberg-1.4.3+Spark-3.3+HiveCatalog
Question
1、background
In the HiveCatalog#dropTable method, including an option boolean purge to perform the purge operation. Additionally, in the SparkCatalog#purgeTable method, utilize a Spark action to implement the purge operation separately, rather than directly executing the purge logic from HiveCatalog#dropTable.
2、question
Why doesn't Spark with HiveCatalog use the existing purge code from HiveCatalog#dropTable for its purge operation, but instead implements it separately?