apache / amoro

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.
https://amoro.apache.org/
Apache License 2.0
874 stars 290 forks source link

[Improvement]: reduce the impact of the listTables method in a unified catalog on the Hive Metastore (HMS) #2986

Open Aireed opened 4 months ago

Aireed commented 4 months ago

Search before asking

What would you like to be improved?

problem:

As mentioned above, if Unified Catalog supports mixed-hive/iceberg/paimon simultaneously, it will call getTables three times, getTableObjectsByName twice (which is a relatively heavy operation), and multiple times getTable.

In addition to being accessed by the frontend to view the table list, the listTables will also be called by the logic to synchronize with the external catalog (default every 3 minutes).

How should we improve?

For the case where the metastore is Hive, we optimize by calling getAllTables and getTableObjectsByName once to retrieve all tables and their types.

  1. Define an interface that supports listing all tables and their formats.
  2. MixedCatalog implements this interface.
  3. MixedHiveCatalog implements this interface.
  4. when call UnifiedCatalog::listTables, we first check the supported FormatCatalog to see if any of them have implemented this interface. If so, we use the table list returned by it instead of calling listTables for each type of FormatCatalog.

Are you willing to submit PR?

Subtasks

No response

Code of Conduct

Aireed commented 4 months ago

@baiyangtx @zhoujinsong WDYT?

Aireed commented 4 months ago

The implementation is roughly like this. (The code is quite old, ArcticCatalog has been replaced with MixedHiveCatalog now). image