Open dramaticlly opened 2 days ago
FYI @szehon-ho and @haizhou-zhao if you are interested
Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.
How do we handle corrupt metadata in other catalog implementations?
Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.
How do we handle corrupt metadata in other catalog implementations?
Thank you @pvary I think this indeed introduce a behavioral change. Majority of existing catalogs (except ECSCatalog) rely on this default implementation in Catalog interface where we tried to load the table first and return true if load is successful. I believe table exists here imply 2 things where both table entry exist in catalog as well as latest table metadata.json is not corrupted.
Personally I think we can focus on former only and here's my thought process
There are roughly 3 places where catalog.tableExists
was used in iceberg code base
registerTable
API https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java#L82
Thanks for the check @dramaticlly! I agree that this behavioural change is small, but I would like to raise awareness around the community about this. Could you please write a letter to the dev list describing what is planned here? So if there is someone who is against the change, they could raise their voices.
Thanks, Peter
Do we want to do the same opimization for the viewExists
method too?
Another minor behavioral change is , earlier if the user had access to both HMS table and storage, the table exists would pass. With the change, tableExists would pass with only access to HMS table?
Skip creation of hive table operation when check existence of iceberg table in hive catalog.
Today the table existence rely on load the table first and return true if table can be loaded https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/api/src/main/java/org/apache/iceberg/catalog/Catalog.java#L279-L286 I found there's opportunity for improvement on hive catalog where we can skip the instantiate of HiveTableOperations, avoid reading the iceberg metadata.json file by only rely on record within hive catalog.
This is important for REST based catalog which delegate work to hiveCatalog as API call volume can be high and this optimization can reduce API overhead and latency.
Why this is safe?
HiveOperationsBase.validateTableIsIceberg
is also used in catalog listTables API to differentiate the iceberg table from hive table