dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.36k stars 1.56k forks source link

[Feature] CLI Parameter for `packages-install-path` #9932

Closed stevenayers closed 1 week ago

stevenayers commented 3 months ago

Is this your first time submitting a feature request?

Describe the feature

Add a CLI parameter for the packages-install-path, similar to how target-path has one.

In the docs, under target-path, it says:

Just like other global configs, it is possible to override these values for your environment or invocation by using the CLI option (--target-path) or environment variables (DBT_TARGET_PATH).

Describe alternatives you've considered

Using the env var DBT_PACKAGES_INSTALL_PATH.

The issue here is that some orchestration tools, such as Databricks DBT Workflows make setting environment variables very difficult. By adding this cli parameter, we maintain consistency across global configs.

Who will this benefit?

People using orchestration tools with awkward limitations.

Are you interested in contributing this feature?

Yes, the PR is https://github.com/dbt-labs/dbt-core/pull/9933

dbeatty10 commented 3 months ago

Thanks for opening this @stevenayers !

Can you share more about the specific use cases where combining a CLI flag with an environment variable is necessary or beneficial versus just merely including the packages-install-path configuration in dbt_project.yml?

stevenayers commented 3 months ago

Hi @dbeatty10, sure no problem! Let me break this down a bit.

Hardcoding packages-install-path

1. In scenarios when docker containers are being used this can raise difficulties. I won't go into too much detail because it's been documented quite well in this issue https://github.com/dbt-labs/dbt-core/issues/1710.

2. When you are dealing with a lot of orchestration/workflow systems you will often find that the working directory of each step does not share the same working directory as the previous, and they can often be dynamic. Take this pipeline as an example:

  graph LR;
      A[dbt debug]-->B[dbt run];
      B-->C[dbt test];
      C-->D[dbt docs generate];

Each working directory could look something like /tmp/job-id/step-id

With this, you don't want to be re-installing your deps at every stage, and likely want to reuse them. This is where, like in issue #1710, you will want to use an environment variable like:

config-version: 2
packages-install-path: "{{ env_var('DBT_PACKAGES_INSTALL_PATH', 'dbt_packages') }}"

You could set packages-install-path: "../dbt_packages", but that's making assumptions when you sometimes need to use shell script logic to figure out what that directory path needs to be.

3. Say you have set packages-install-path to /tmp/my_custom_packages_path so it can be shared between steps. What if you're also running your CI/CD test pipeline in that environment?

Your packages.yml is changed in your feature branch, which updates the package contents in /tmp/my_custom_packages_path. Your live data pipeline is in the middle of running, and when it goes to run, it fails because your feature branch has removed packages your live data pipeline was using when it was running.

This is where you'll want to do something like:

config-version: 2
packages-install-path: "{{ env_var('DBT_PACKAGES_INSTALL_PATH', 'dbt_packages') }}"

and in your pipeline you'll want to set DBT_PACKAGES_INSTALL_PATH to something like /tmp/${ENVIRONMENT}/dbt_packages.

Flag vs env var for packages-install-path

As I mentioned in the original issue, sometimes setting an environment variable can be a pain in some workflow systems. This also isn't very consistent or clean: DBT_PACKAGES_INSTALL_PATH=/tmp/${ENVIRONMENT}/dbt_packages dbt run --target-path /tmp/${ENVIRONMENT}/target

You're setting config paths via two different methods.

dbeatty10 commented 1 week ago

Yesterday @jtcohen6 and myself had a chance to discuss the proposed new CLI flag + environment variable.

Summary

The general case

We've approached where flags can be set differently depending on use-case:

So generally, we don't let these be set in both places, and it would take a really compelling case for us to do so.

This specific case

In this case, it sounds like the main barrier is that setting environment variables is difficult within Databricks DBT Workflows. If this is the primary barrier, then we'd prefer not to add a new feature to dbt in order to work around it.

So we're closing this and the associated PR in https://github.com/dbt-labs/dbt-core/pull/9933 as not planned.

But if anyone can provide additional examples why should consider supporting a new --packages-install-path CLI flag (and associated DBT_PACKAGES_INSTALL_PATH environment variable) outside of Databricks DBT Workflows, we'd be willing to take another look.