dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.92k stars 1.63k forks source link

[CT-97] [Bug] dbt cli with an empty command takes more than a second to run #4627

Closed ajmarcus closed 2 years ago

ajmarcus commented 2 years ago

Is there an existing issue for this?

Current Behavior

When I run the dbt command in my terminal without additional arguments, it takes more than a second to print the help text:

% time dbt
...output...
dbt  1.67s user 0.13s system 98% cpu 1.828 total

Looks like this is a related issue with percieved run performance for specific commands: https://github.com/dbt-labs/dbt-core/issues/4625

Expected Behavior

Basic operations with the dbt cli would take less than one second to run.

Steps To Reproduce

Install dbt:

pip install dbt-core

Run dbt:

% time dbt
...output...
dbt  1.67s user 0.13s system 98% cpu 1.828 total

Relevant log output

This could deflinitely be an issue with my machine or setup. To see if I could find more information I used py-spy to generate a flamegraph using these commands:

git clone 'git@github.com:dbt-labs/dbt-core.git'
cd dbt-core
pip install py-spy
sudo py-spy record -o profile_cli_empty.svg -- python core/scripts/dbt

There is nothing conclusive, but looks like there is a lot of time spent with calls to importlib:

https://gist.githubusercontent.com/ajmarcus/b077f4c8a94582df74c39258325addf5/raw/9af82bb55f5b322fb10bc0a959c20b175e689c3c/profile_cli_empty.svg

Which reminded me that Python eagerly loads imported dependencies. importlib does provide a lazy import loader but it comes with this warning:

https://python.readthedocs.io/en/latest/library/importlib.html#importlib.util.LazyLoader

For projects where startup time is critical, this class allows for potentially minimizing the cost of loading a module if it is never used. For projects where startup time is not essential then use of this class is heavily discouraged due to error messages created during loading being postponed and thus occurring out of context.

Environment

- OS: Mac 12.0.1
- Python: 3.7.8
- dbt: 1.0.1

What database are you using dbt with?

bigquery

Additional Context

profile_cli_empty

ajmarcus commented 2 years ago

Haven't worked with flamegraphs before so took me some time to understand this one. I think a good amount of the runtime is importing the classes from dbt.task that are used to set_defaults for the subparsers:

https://github.com/dbt-labs/dbt-core/blob/v1.0.1/core/dbt/main.py#L18-L33 https://github.com/dbt-labs/dbt-core/blob/v1.0.1/core/dbt/main.py#L358

Since the defaults are only used if a subcommand is being called, was thinking we could improve performance by only importing the relevant dbt.task. For example:

  1. If the user uses the command dbt build we will only import dbt.task.build as build_task
  2. If the user uses the command dbt clean we will only import dbt.task.clean as clean_task
  3. If the user uses the command dbt without a subcommand we will not import any classes from dbt.task

If that sounds like a good approach, I'd be happy to create an example PR. As with any performance optimizations, I'm sure there will be tradeoffs to my approach so no worries if the code is never merged. Curious to learn more about dbt's internals. :)

nathaniel-may commented 2 years ago

Thanks for the detailed analysis @ajmarcus! Anecdotally, I definitely notice this myself when I call dbt --version and I think you're right that we can probably benefit from a quick fix. Thankfully we're planning on making larger changes to the structure of our CLI to make it more reusable which may also solve these performance issues outlined here. Thank you for offering to put up a PR but I think in this case I think we should save the effort since it is likely to be addressed soon anyway. You can follow that work once it starts in earnest on two these tickets:

4179

2013

Because of this, I'm going to close this ticket but explicitly mention it in the other tickets as additional context for why this work is important. Thanks so much for providing all the details here.

ajmarcus commented 2 years ago

Sounds great and makes sense!