Added query pagination logic
Changed the metadata_database_scraper.py logic, to be more optimized and work with a large numbers tables within a Hive environment, this was tested with 18896 tables in a Hive Metastore backed by PostgreSQL and with less tables on a MySQL backed Metastore.
The pagination is done at the database level, the scraper class opens a session for every 5 Databases using the pagination logic created.
Ideally we should add pagination to the table level too, but this will add extra complexity to the sqlalchemy logic.
Changed the sqlalchemy query strategy to execute the query in the same session
Before it was using the defaultload strategy that lazily executes the queries, so if the number of assets was too large, this could lead to the query being executed after the session was closed.
Now it uses subqueryload which creates a subquery and returns all tables for each database in the same session and executes it early.
There is room for improving this by also adding pagination to the table level, the test execution reported that this worked well with 18896 tables, I suggest tracking this, and see if this next improvement is needed.
- How I did it
Added the pagination logic, this is managed internally by the MetadataDatabaseScraper class. In a subsequent PR we should enable users to change this config using a user provided config, like we do in the RDBMS connectors.
Used the subqueryload function.
- How to verify it
Run the Hive connector in a Hive metastore that has more than 15.000 tables.
- Description for the changelog
For reference with this change, successfully ingested 18896 Hive tables running the connector inside GCP, using a e2-standard-2 VM in ~20 minutes:
Scrape phase log
==============Prepare metadata===============
INFO:root:
Preparing the metadata...
INFO:root:
--> Database: default
INFO:root:
0 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouse6c790c3d
INFO:root:
5 tables ready to be ingested...
INFO:root:
--> Database: school_warehouseecfd699d
INFO:root:
5 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse4daf28d6
INFO:root:
5 tables ready to be ingested...
INFO:root:
--> Database: factory_warehoused5aef86b
INFO:root:
5 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse896cb63d
INFO:root:
5 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse6222269a
INFO:root:
1 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse2d55e746
INFO:root:
370 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse8b40a01c
INFO:root:
600 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse20872b3c
INFO:root:
615 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse3bcafce1
INFO:root:
572 tables ready to be ingested...
INFO:root:
--> Database: school_warehousea99c6e05
INFO:root:
587 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse5a491dca
INFO:root:
644 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse633563c9
INFO:root:
667 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse171f1870
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse94bdf270
INFO:root:
140 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse2f261704
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouseed7a3008
INFO:root:
187 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse542b1539
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehouse449c2ff3
INFO:root:
95 tables ready to be ingested...
INFO:root:
--> Database: company_warehouseb53ab401
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouse9ed6aa1c
INFO:root:
92 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse3ab8b955
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehousea9d3e4ee
INFO:root:
250 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse8cb4aa80
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: company_warehouseafa98049
INFO:root:
155 tables ready to be ingested...
INFO:root:
--> Database: company_warehousea477c097
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehousecc1435f9
INFO:root:
96 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehouse47d6a8d7
INFO:root:
0 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse1e4e7214
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse562beea5
INFO:root:
178 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehouse58b20cef
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse5b78c778
INFO:root:
171 tables ready to be ingested...
INFO:root:
--> Database: factory_warehousede2e3277
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehouseb54b6e6b
INFO:root:
163 tables ready to be ingested...
INFO:root:
--> Database: factory_warehousefd1b9f30
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehousef6726aef
INFO:root:
95 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouse7ffff842
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: company_warehousedb5688a1
INFO:root:
86 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse83310b44
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouse03d0085d
INFO:root:
62 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehouse7548c922
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse773a7423
INFO:root:
67 tables ready to be ingested...
INFO:root:
--> Database: factory_warehousec24e6cd5
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse4f2d3b1c
INFO:root:
154 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse550ca545
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehoused8f25979
INFO:root:
163 tables ready to be ingested...
INFO:root:
--> Database: organization_warehoused1471bcc
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouseb3d1f90a
INFO:root:
183 tables ready to be ingested...
INFO:root:
--> Database: company_warehouseecbdf433
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouseaf0a0e1f
INFO:root:
75 tables ready to be ingested...
INFO:root:
--> Database: organization_warehousedf85e0c3
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: school_warehouse695f153e
INFO:root:
158 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse48d56317
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse9f9800e5
INFO:root:
242 tables ready to be ingested...
INFO:root:
--> Database: company_warehouse347535e7
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouseae17763a
INFO:root:
161 tables ready to be ingested...
INFO:root:
--> Database: on_prem_warehousedfca5bcb
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse475be898
INFO:root:
154 tables ready to be ingested...
INFO:root:
--> Database: factory_warehouse75b7f6a6
INFO:root:
500 tables ready to be ingested...
INFO:root:
--> Database: organization_warehouse7192ecb2
INFO:root:
188 tables ready to be ingested...
INFO:root:
==============Ingest metadata===============
Before the same scenario would throw a session timeout error. Fixes #21
- What I did
Added query pagination logic
Changed the
metadata_database_scraper.py
logic, to be more optimized and work with a large numbers tables within a Hive environment, this was tested with 18896 tables in a Hive Metastore backed by PostgreSQL and with less tables on a MySQL backed Metastore. The pagination is done at the database level, the scraper class opens a session for every 5 Databases using the pagination logic created. Ideally we should add pagination to the table level too, but this will add extra complexity to the sqlalchemy logic.Changed the sqlalchemy query strategy to execute the query in the same session
Before it was using the
defaultload
strategy that lazily executes the queries, so if the number of assets was too large, this could lead to the query being executed after the session was closed. Now it usessubqueryload
which creates a subquery and returns all tables for each database in the same session and executes it early.There is room for improving this by also adding pagination to the table level, the test execution reported that this worked well with 18896 tables, I suggest tracking this, and see if this next improvement is needed.
- How I did it
Added the pagination logic, this is managed internally by the
MetadataDatabaseScraper
class. In a subsequent PR we should enable users to change this config using a user provided config, like we do in the RDBMS connectors.Used the
subqueryload
function.- How to verify it
Run the Hive connector in a Hive metastore that has more than 15.000 tables.
- Description for the changelog For reference with this change, successfully ingested 18896 Hive tables running the connector inside GCP, using a e2-standard-2 VM in ~20 minutes:
Scrape phase log
==============Prepare metadata=============== INFO:root: Preparing the metadata... INFO:root: --> Database: default INFO:root: 0 tables ready to be ingested... INFO:root: --> Database: factory_warehouse6c790c3d INFO:root: 5 tables ready to be ingested... INFO:root: --> Database: school_warehouseecfd699d INFO:root: 5 tables ready to be ingested... INFO:root: --> Database: company_warehouse4daf28d6 INFO:root: 5 tables ready to be ingested... INFO:root: --> Database: factory_warehoused5aef86b INFO:root: 5 tables ready to be ingested... INFO:root: --> Database: school_warehouse896cb63d INFO:root: 5 tables ready to be ingested... INFO:root: --> Database: company_warehouse6222269a INFO:root: 1 tables ready to be ingested... INFO:root: --> Database: company_warehouse2d55e746 INFO:root: 370 tables ready to be ingested... INFO:root: --> Database: organization_warehouse8b40a01c INFO:root: 600 tables ready to be ingested... INFO:root: --> Database: organization_warehouse20872b3c INFO:root: 615 tables ready to be ingested... INFO:root: --> Database: school_warehouse3bcafce1 INFO:root: 572 tables ready to be ingested... INFO:root: --> Database: school_warehousea99c6e05 INFO:root: 587 tables ready to be ingested... INFO:root: --> Database: organization_warehouse5a491dca INFO:root: 644 tables ready to be ingested... INFO:root: --> Database: school_warehouse633563c9 INFO:root: 667 tables ready to be ingested... INFO:root: --> Database: company_warehouse171f1870 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: organization_warehouse94bdf270 INFO:root: 140 tables ready to be ingested... INFO:root: --> Database: company_warehouse2f261704 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: factory_warehouseed7a3008 INFO:root: 187 tables ready to be ingested... INFO:root: --> Database: school_warehouse542b1539 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: on_prem_warehouse449c2ff3 INFO:root: 95 tables ready to be ingested... INFO:root: --> Database: company_warehouseb53ab401 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: factory_warehouse9ed6aa1c INFO:root: 92 tables ready to be ingested... INFO:root: --> Database: school_warehouse3ab8b955 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: factory_warehousea9d3e4ee INFO:root: 250 tables ready to be ingested... INFO:root: --> Database: school_warehouse8cb4aa80 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: company_warehouseafa98049 INFO:root: 155 tables ready to be ingested... INFO:root: --> Database: company_warehousea477c097 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: school_warehousecc1435f9 INFO:root: 96 tables ready to be ingested... INFO:root: --> Database: on_prem_warehouse47d6a8d7 INFO:root: 0 tables ready to be ingested... INFO:root: --> Database: organization_warehouse1e4e7214 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: company_warehouse562beea5 INFO:root: 178 tables ready to be ingested... INFO:root: --> Database: on_prem_warehouse58b20cef INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: school_warehouse5b78c778 INFO:root: 171 tables ready to be ingested... INFO:root: --> Database: factory_warehousede2e3277 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: on_prem_warehouseb54b6e6b INFO:root: 163 tables ready to be ingested... INFO:root: --> Database: factory_warehousefd1b9f30 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: school_warehousef6726aef INFO:root: 95 tables ready to be ingested... INFO:root: --> Database: factory_warehouse7ffff842 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: company_warehousedb5688a1 INFO:root: 86 tables ready to be ingested... INFO:root: --> Database: organization_warehouse83310b44 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: factory_warehouse03d0085d INFO:root: 62 tables ready to be ingested... INFO:root: --> Database: on_prem_warehouse7548c922 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: school_warehouse773a7423 INFO:root: 67 tables ready to be ingested... INFO:root: --> Database: factory_warehousec24e6cd5 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: school_warehouse4f2d3b1c INFO:root: 154 tables ready to be ingested... INFO:root: --> Database: organization_warehouse550ca545 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: on_prem_warehoused8f25979 INFO:root: 163 tables ready to be ingested... INFO:root: --> Database: organization_warehoused1471bcc INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: factory_warehouseb3d1f90a INFO:root: 183 tables ready to be ingested... INFO:root: --> Database: company_warehouseecbdf433 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: factory_warehouseaf0a0e1f INFO:root: 75 tables ready to be ingested... INFO:root: --> Database: organization_warehousedf85e0c3 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: school_warehouse695f153e INFO:root: 158 tables ready to be ingested... INFO:root: --> Database: organization_warehouse48d56317 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: company_warehouse9f9800e5 INFO:root: 242 tables ready to be ingested... INFO:root: --> Database: company_warehouse347535e7 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: factory_warehouseae17763a INFO:root: 161 tables ready to be ingested... INFO:root: --> Database: on_prem_warehousedfca5bcb INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: organization_warehouse475be898 INFO:root: 154 tables ready to be ingested... INFO:root: --> Database: factory_warehouse75b7f6a6 INFO:root: 500 tables ready to be ingested... INFO:root: --> Database: organization_warehouse7192ecb2 INFO:root: 188 tables ready to be ingested... INFO:root: ==============Ingest metadata===============Before the same scenario would throw a session timeout error. Fixes #21