apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.49k stars 2.24k forks source link

Hive: Optimize tableExists API in hive catalog #11597

Open dramaticlly opened 2 days ago

dramaticlly commented 2 days ago

Skip creation of hive table operation when check existence of iceberg table in hive catalog.

Today the table existence rely on load the table first and return true if table can be loaded https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/api/src/main/java/org/apache/iceberg/catalog/Catalog.java#L279-L286 I found there's opportunity for improvement on hive catalog where we can skip the instantiate of HiveTableOperations, avoid reading the iceberg metadata.json file by only rely on record within hive catalog.

This is important for REST based catalog which delegate work to hiveCatalog as API call volume can be high and this optimization can reduce API overhead and latency.

Why this is safe? HiveOperationsBase.validateTableIsIceberg is also used in catalog listTables API to differentiate the iceberg table from hive table

dramaticlly commented 2 days ago

FYI @szehon-ho and @haizhou-zhao if you are interested

pvary commented 1 day ago

Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.

How do we handle corrupt metadata in other catalog implementations?

dramaticlly commented 1 day ago

Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.

How do we handle corrupt metadata in other catalog implementations?

Thank you @pvary I think this indeed introduce a behavioral change. Majority of existing catalogs (except ECSCatalog) rely on this default implementation in Catalog interface where we tried to load the table first and return true if load is successful. I believe table exists here imply 2 things where both table entry exist in catalog as well as latest table metadata.json is not corrupted.

Personally I think we can focus on former only and here's my thought process

There are roughly 3 places where catalog.tableExists was used in iceberg code base

  1. Check before table can be registered in registerTable API https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java#L82
    • I believe behaviour change is allowed here as long as entry exist in catalog, register shall fail regardless files is corrupted
  2. Check before table stage creation, this is only used in REST catalog handler https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java#L228
    • I believe behaviour change is also allowed here as long as entry exist in catalog, stage creation shall fail regardless version files is corrupted
  3. REST API to check for table existence: https://github.com/apache/iceberg/blob/3badfe0c1fcf0c0adfc7aa4a10f0b50365c48cf9/open-api/rest-catalog-open-api.yaml#L1129-L1133
    • I think this is what I originally hoped for to optimize on, to speed up on the existence check without rely on reading metadata first. The reason is that sometimes existence check is all we need without subsequent load table call
pvary commented 16 hours ago

Thanks for the check @dramaticlly! I agree that this behavioural change is small, but I would like to raise awareness around the community about this. Could you please write a letter to the dev list describing what is planned here? So if there is someone who is against the change, they could raise their voices.

Thanks, Peter

pvary commented 16 hours ago

Do we want to do the same opimization for the viewExists method too?

karuppayya commented 6 hours ago

Another minor behavioral change is , earlier if the user had access to both HMS table and storage, the table exists would pass. With the change, tableExists would pass with only access to HMS table?