dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.88k stars 1.63k forks source link

[Regression] Partial parsing is disabled by "secret" env vars in profile connection info #10571

Closed fredriv closed 1 month ago

fredriv commented 2 months ago

Is this a regression in a recent version of dbt-core?

Current Behavior

We're using a "secret" env var (DBT_ENV_SECRET_...) in our dbt profile to avoid full parse when changing e.g. Snowflake warehouse between dbt runs, which does not require a full parse.

This stopped working after dbt 1.6.12 due to the change in https://github.com/dbt-labs/dbt-core/pull/9844. It now creates a profile hash of all the keys in the connection info but does not mask values that come from "secret" vars into ***** (unlike e.g. dbt debug).

Due to the size of our dbt project (2000+ models), this adds around 4-5 minutes of parse time to our dbt runs.

Have tested and found this behavior with dbt 1.6.18, 1.7.18 and 1.8.4.

Expected/Previous Behavior

The issue (https://github.com/dbt-labs/dbt-core/issues/9795) related to the change above mentions the workaround to avoid full reparses by using "secret" env vars, and states in the expected behavior that Users should retan the ability to "opt" out additional fields by setting them to "secret" env vars, the values of which dbt does not track. However this is not the case since dbt 1.6.12.

We expect the previous functionality to work, being able to avoid full reparses by using "secret" env vars in the profile connection info.

Steps To Reproduce

With dbt 1.6.12 or later release, create a profile.yml with a "secret" env var in one of the connection parameters, e.g.

warehouse: "{{ env_var('DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE', 'small') }}"

Run a dbt command that requires parsing, such as dbt ls -s my_model. Re-run the command but set the environment variable to a different value, e.g.

DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE=large dbt ls -s my_model

Observe that this triggers a full parse and prints the message Unable to do partial parsing because profile has changed.

Relevant log output

No response

Environment

- OS: MacOS Sonoma 14.5
- Python: 3.11.4
- dbt (working version): 1.6.11
- dbt (regression version): 1.6.12+, 1.7.18, 1.8.4

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

fredriv commented 2 months ago

I tried modifying / adding to the test test_profile_secret_env_vars (link), but I'm not sure the test works as expected 🤔

I added some debug-output of the connection info from the build_manifest_state_check method, but it output the same user: root in both runs inside the test, even though the username secret env var was changed to fake_user for the second run.

fredriv commented 2 months ago

I modified the test class TestProfileSecretEnvVars to work more like TestProfileEnvVars in the same file, making the dbt_profile_target method a fixture (not a property) and adding an environment fixture. Now the debug output showed that the username changed to fake_user in the second run. The test failed as expected due to the profile having changed and the secret env var not being masked when computing the profile hash.

fredriv commented 2 months ago

Actually, the test fails before getting to the partial parsing logic, since now it's not able to connect to Postgres with the fake_user 😅

Edit: Fixed this by setting the search path as a secret env var (instead of the username), i.e.

os.environ[SECRET_ENV_PREFIX + "_SEARCH_PATH"] = "public"
fredriv commented 2 months ago

Any ideas or pointers on how to go about fixing this issue? Should it fetch the connection info values from somewhere else that handles masking of the secret env vars? (I assume they have to be in cleartext in the connection info itself if that is used for the DB connections?) Any thoughts @dbeatty10?

dbeatty10 commented 2 months ago

Thanks for reaching out @fredriv !

While we recognize that the behavior changed for you, we did make this change intentionally [1, 2], and we won't be reinstating the workaround to avoid full reparses by using "secret" env vars. We've opened https://github.com/dbt-labs/dbt-core/issues/10578 and https://github.com/dbt-labs/dbt-core/issues/10579 as follow-ups.

Is this something that just affects your local development set ups? Or does it affect your production runs as well? It would be helpful to hear more detail why you chose to configure warehouse dynamically within your profiles.yml rather than static warehouse values in separate output targets for dev, prod, etc.

See below for a couple alternatives you could try out instead -- let me know what you think!

Alternative approach 1a

You can achieve a similar effect by using the DBT_TARGET_PATH environment variable.

It will allow you to specify where to store separate copies of the partial_parse.msgpack file used for partial parsing.

In your case, you'll want to use separate target directories for each different value of DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE.

Here's how you can use it:

export DBT_TARGET_PATH=artifacts-dev-small
DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE=small dbt parse
DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE=small dbt ls -s my_model

The ls command should take advantage of partial parsing and be a fast operation.

Then you can switch to use a different partial parsing file like this:

export DBT_TARGET_PATH=artifacts-dev-large
DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE=large dbt parse
DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE=large dbt ls -s my_model

Alternative approach 1b

You may find it more convenient to switch from the DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE environment variable to using the --target flag instead. It would look like this:

Set up separate output targets per warehouse size:

profiles.yml

YOUR_PROFILE_NAME:
  target: dev
  outputs:
    dev:
      warehouse: large
      # ...
    large:
      warehouse: large
      # ...
    small:
      warehouse: small
      # ...

Then use commands like this to use the desired target (along with the desired path for the partial parsing file):

export DBT_TARGET_PATH=artifacts-dev-large
dbt parse --target large
dbt ls -s my_model --target large
fredriv commented 2 months ago

Thanks for the detailed answer @dbeatty10!

This currently affects both our production environment and CI/CD jobs (and to some degree dev environments).

We have a setup with a single dbt monorepo with about 2.300 models. Model ownership is distributed between ~10 data teams in a "data mesh" like fashion, and these teams own and run a total of ~80 dbt cron jobs of various sizes and complexity. We use a lot of differently sized Snowflake warehouses for these dbt jobs to optimize cost vs. performance, all the way from X-Small to 3X-Large.

To avoid full re-parses for each dbt job run, we have a Github action which runs in production and generates the dbt manifest and partial_parse.msgpack upon each merge to main branch in our Git repository. This is stored in a Google Cloud Storage bucket and later used as input to the dbt cron jobs mentioned above.

The secret env var for setting the Snowflake warehouse allowed us to do this across these various cron jobs while avoiding a full project parse, which would add another ~5 minutes to the job run. We also do something similar in our CI/CD jobs and for running ad hoc dbt jobs.

To set up something similar using separate targets (approach 1b) would then mean specifying all combinations of environments (prod/qa/dev) and warehouse sizes as separate targets? This would also require generating, storing and keeping track of a lot more dbt artifacts (in the Github action + dbt cron jobs mentioned above), in addition to a lot of duplicate Snowflake configuration in profiles.yml.

dbeatty10 commented 2 months ago

Are you using the snowflake_warehouse config at all? It can be configured in a few different places:

  1. dbt_project.yml
  2. properties YAML config
  3. model SQL config

There's some more description of various options here.

1. dbt_project.yml

dbt_project.yml

models:
  my_project:
    +snowflake_warehouse: "EXTRA_SMALL"    # default snowflake virtual warehouse for all models in the project
    clickstream:
      +snowflake_warehouse: "EXTRA_LARGE"  # override the default snowflake virtual warehouse for all models under the `clickstream` directory

2. Properties YAML config

models/_models.yml

models:
  - name: my_model
    config:
      snowflake_warehouse: "EXTRA_LARGE"    # override the snowflake virtual warehouse for just this model

3. Model SQL config

models/my_model.sql

-- override the snowflake virtual warehouse for just this model
{{
  config(
    snowflake_warehouse='EXTRA_LARGE'
  )
}}

-- ...
fredriv commented 1 month ago

(Re-)made a PR to fix the slow full parses: https://github.com/dbt-labs/dbt-common/pull/189

So if that gets merged and released we should no longer have need for the workarounds described above to avoid full parses 🙂

dbeatty10 commented 1 month ago

@fredriv Thanks for opening https://github.com/dbt-labs/dbt-common/pull/189 to resolve https://github.com/dbt-labs/dbt-core/issues/9037 🤩

I'm going to close this one as not planned since we intentionally made the related change.