GoogleCloudPlatform / datacatalog-connectors-hive

Sample code with integration between Data Catalog and Hive data source.
Apache License 2.0
25 stars 14 forks source link

ADD pagination logic and change the scrape query session strategy to be more optimized. #23

Closed mesmacosta closed 3 years ago

mesmacosta commented 3 years ago

- What I did

There is room for improving this by also adding pagination to the table level, the test execution reported that this worked well with 18896 tables, I suggest tracking this, and see if this next improvement is needed.

- How I did it

Added the pagination logic, this is managed internally by the MetadataDatabaseScraper class. In a subsequent PR we should enable users to change this config using a user provided config, like we do in the RDBMS connectors.

Used the subqueryload function.

- How to verify it

Run the Hive connector in a Hive metastore that has more than 15.000 tables.

- Description for the changelog For reference with this change, successfully ingested 18896 Hive tables running the connector inside GCP, using a e2-standard-2 VM in ~20 minutes:

Scrape phase log ==============Prepare metadata===============
INFO:root:
Preparing the metadata...
INFO:root:
--> Database: default
INFO:root:
0 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouse6c790c3d
INFO:root:
5 tables ready to be ingested...
INFO:root:
--> Database: school_warehouseecfd699d
INFO:root:
5 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse4daf28d6
INFO:root:
5 tables ready to be ingested...
INFO:root:
--> Database: factory_warehoused5aef86b
INFO:root:
5 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse896cb63d
INFO:root:
5 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse6222269a
INFO:root:
1 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse2d55e746
INFO:root:
370 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse8b40a01c
INFO:root:
600 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse20872b3c
INFO:root:
615 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse3bcafce1
INFO:root:
572 tables ready to be ingested...
INFO:root:
--> Database: school_warehousea99c6e05
INFO:root:
587 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse5a491dca
INFO:root:
644 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse633563c9
INFO:root:
667 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse171f1870
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse94bdf270
INFO:root:
140 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse2f261704
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouseed7a3008
INFO:root:
187 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse542b1539
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehouse449c2ff3
INFO:root:
95 tables ready to be ingested...
INFO:root:
--> Database: company_warehouseb53ab401
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouse9ed6aa1c
INFO:root:
92 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse3ab8b955
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehousea9d3e4ee
INFO:root:
250 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse8cb4aa80
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: company_warehouseafa98049
INFO:root:
155 tables ready to be ingested...
INFO:root:
--> Database: company_warehousea477c097
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehousecc1435f9
INFO:root:
96 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehouse47d6a8d7
INFO:root:
0 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse1e4e7214
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse562beea5
INFO:root:
178 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehouse58b20cef
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse5b78c778
INFO:root:
171 tables ready to be ingested...
INFO:root:
--> Database: factory_warehousede2e3277
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehouseb54b6e6b
INFO:root:
163 tables ready to be ingested...
INFO:root:
--> Database: factory_warehousefd1b9f30
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehousef6726aef
INFO:root:
95 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouse7ffff842
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: company_warehousedb5688a1 INFO:root:
86 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse83310b44
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouse03d0085d
INFO:root:
62 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehouse7548c922
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse773a7423
INFO:root:
67 tables ready to be ingested...
INFO:root:
--> Database: factory_warehousec24e6cd5
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse4f2d3b1c
INFO:root:
154 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse550ca545
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehoused8f25979
INFO:root:
163 tables ready to be ingested...
INFO:root:
--> Database: organization_warehoused1471bcc
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouseb3d1f90a
INFO:root:
183 tables ready to be ingested...
INFO:root:
--> Database: company_warehouseecbdf433
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouseaf0a0e1f
INFO:root:
75 tables ready to be ingested...
INFO:root:
--> Database: organization_warehousedf85e0c3
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse695f153e
INFO:root:
158 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse48d56317
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse9f9800e5
INFO:root:
242 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse347535e7
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouseae17763a
INFO:root:
161 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehousedfca5bcb
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse475be898
INFO:root:
154 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouse75b7f6a6
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse7192ecb2
INFO:root:
188 tables ready to be ingested...
INFO:root:
==============Ingest metadata===============

Before the same scenario would throw a session timeout error. Fixes #21