dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
298 stars 120 forks source link

Multithread support for stage_external_sources #280

Closed azdoherty closed 5 months ago

azdoherty commented 5 months ago

Describe the feature

Stage external resources would run a lot faster if it used multiple threads for multiple tables

Describe alternatives you've considered

I had previously used a pre-hook before each model that referenced an external table, which as they were part of the models did run in parallel. This implementation was a bit messy though as the external table did not appear in the DAG and you had to include a CREATE OR REPLACE EXTERNAL TABLE ... in your model

Additional context

I have only used this in bigquery

Who will this benefit?

Anyone with a lot of external tables they need to stage before each build - I have 10 and it takes over a minute, and it will scale linearly with the number of external tables

azdoherty commented 5 months ago

Should I close this due to the discussion here? https://github.com/dbt-labs/dbt-adapters/discussions/92

jeremyyeo commented 5 months ago

Hey @azdoherty definitely move that discussion over there. Fwiw - this is probably a dbt-core library issue - it's not possible to run SQL statements in parallel today - dbt-external-table package or otherwise. I've provided the same workarounds as you have done - via hooks since models can run in parallel and some other funky patterns using custom materializations: https://gist.github.com/jeremyyeo/b61655a3e5a52eb27640363650c79a1e - idea is the same though - models run in parallel (up to threads config) so use that mechanism to do parallel run operations instead.

However - this is primarily a dbt-core / dbt-adapters library issue imho.

Additionally this is likely a dupe of #109