dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.77k stars 1.62k forks source link

[Feature] `dbt deps` is slow, suggest parallel package loading #8949

Open leo-schick opened 11 months ago

leo-schick commented 11 months ago

Is this your first time submitting a feature request?

Describe the feature

Running dbt deps is quite slow in my huge project. It runs for 1.73 minutes:

image

I have in total 18 packages which I import: 2 pages from the dbt hub and 16 from a git repository on GitHub using the git notation git: "git@github.com:user/repo.git". I think this process should be improved in speed. For example, by retrieving the repositories in parallel instead of in a single thread.

Describe alternatives you've considered

No response

Who will this benefit?

Everybody which imports more than one package.

Are you interested in contributing this feature?

No response

Anything else?

No response

graciegoheen commented 11 months ago

Hey @leo-schick! We're currently working on an effort to improve the performance of dbt deps - namely, providing a way to only install the changed/new packages on a subsequent dbt deps. Relevant issue here.

What you're proposing, however, would improve the performance of the initial dbt deps - I don't think this piece will be a high priority for us currently, but definitely an enhancement we could tackle in the future!

leo-schick commented 11 months ago

Hey @graciegoheen This is great to hear! I think on local installations #6643 will be of great help. However, in environments where instances are rebuild every job run (e.g. in Databricks), I think this ticket is a possible way to increase the speed there.

I do not have so deep knowledge about the dbt code base. Maybe it is possible to run dbt deps inside the same class which runs the models in parallel. This would help to reduce duplicated code.