Closed fredriv closed 1 month ago
I tried modifying / adding to the test test_profile_secret_env_vars
(link), but I'm not sure the test works as expected 🤔
I added some debug-output of the connection info from the build_manifest_state_check
method, but it output the same user: root
in both run
s inside the test, even though the username secret env var was changed to fake_user
for the second run.
I modified the test class TestProfileSecretEnvVars
to work more like TestProfileEnvVars
in the same file, making the dbt_profile_target
method a fixture (not a property) and adding an environment
fixture. Now the debug output showed that the username changed to fake_user
in the second run. The test failed as expected due to the profile having changed and the secret env var not being masked when computing the profile hash.
Actually, the test fails before getting to the partial parsing logic, since now it's not able to connect to Postgres with the fake_user
😅
Edit: Fixed this by setting the search path as a secret env var (instead of the username), i.e.
os.environ[SECRET_ENV_PREFIX + "_SEARCH_PATH"] = "public"
Any ideas or pointers on how to go about fixing this issue? Should it fetch the connection info values from somewhere else that handles masking of the secret env vars? (I assume they have to be in cleartext in the connection info itself if that is used for the DB connections?) Any thoughts @dbeatty10?
Thanks for reaching out @fredriv !
While we recognize that the behavior changed for you, we did make this change intentionally [1, 2], and we won't be reinstating the workaround to avoid full reparses by using "secret" env vars. We've opened https://github.com/dbt-labs/dbt-core/issues/10578 and https://github.com/dbt-labs/dbt-core/issues/10579 as follow-ups.
Is this something that just affects your local development set ups? Or does it affect your production runs as well? It would be helpful to hear more detail why you chose to configure warehouse
dynamically within your profiles.yml
rather than static warehouse
values in separate output targets for dev, prod, etc.
See below for a couple alternatives you could try out instead -- let me know what you think!
You can achieve a similar effect by using the DBT_TARGET_PATH
environment variable.
It will allow you to specify where to store separate copies of the partial_parse.msgpack
file used for partial parsing.
In your case, you'll want to use separate target directories for each different value of DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE
.
Here's how you can use it:
export DBT_TARGET_PATH=artifacts-dev-small
DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE=small dbt parse
DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE=small dbt ls -s my_model
The ls
command should take advantage of partial parsing and be a fast operation.
Then you can switch to use a different partial parsing file like this:
export DBT_TARGET_PATH=artifacts-dev-large
DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE=large dbt parse
DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE=large dbt ls -s my_model
You may find it more convenient to switch from the DBT_ENV_SECRET_SNOWFLAKE_WAREHOUSE
environment variable to using the --target
flag instead. It would look like this:
Set up separate output targets per warehouse size:
profiles.yml
YOUR_PROFILE_NAME:
target: dev
outputs:
dev:
warehouse: large
# ...
large:
warehouse: large
# ...
small:
warehouse: small
# ...
Then use commands like this to use the desired target (along with the desired path for the partial parsing file):
export DBT_TARGET_PATH=artifacts-dev-large
dbt parse --target large
dbt ls -s my_model --target large
Thanks for the detailed answer @dbeatty10!
This currently affects both our production environment and CI/CD jobs (and to some degree dev environments).
We have a setup with a single dbt monorepo with about 2.300 models. Model ownership is distributed between ~10 data teams in a "data mesh" like fashion, and these teams own and run a total of ~80 dbt cron jobs of various sizes and complexity. We use a lot of differently sized Snowflake warehouses for these dbt jobs to optimize cost vs. performance, all the way from X-Small to 3X-Large.
To avoid full re-parses for each dbt job run, we have a Github action which runs in production and generates the dbt manifest and partial_parse.msgpack
upon each merge to main
branch in our Git repository. This is stored in a Google Cloud Storage bucket and later used as input to the dbt cron jobs mentioned above.
The secret env var for setting the Snowflake warehouse allowed us to do this across these various cron jobs while avoiding a full project parse, which would add another ~5 minutes to the job run. We also do something similar in our CI/CD jobs and for running ad hoc dbt jobs.
To set up something similar using separate targets (approach 1b) would then mean specifying all combinations of environments (prod/qa/dev) and warehouse sizes as separate targets? This would also require generating, storing and keeping track of a lot more dbt artifacts (in the Github action + dbt cron jobs mentioned above), in addition to a lot of duplicate Snowflake configuration in profiles.yml
.
Are you using the snowflake_warehouse
config at all? It can be configured in a few different places:
dbt_project.yml
There's some more description of various options here.
dbt_project.yml
dbt_project.yml
models:
my_project:
+snowflake_warehouse: "EXTRA_SMALL" # default snowflake virtual warehouse for all models in the project
clickstream:
+snowflake_warehouse: "EXTRA_LARGE" # override the default snowflake virtual warehouse for all models under the `clickstream` directory
models/_models.yml
models:
- name: my_model
config:
snowflake_warehouse: "EXTRA_LARGE" # override the snowflake virtual warehouse for just this model
models/my_model.sql
-- override the snowflake virtual warehouse for just this model
{{
config(
snowflake_warehouse='EXTRA_LARGE'
)
}}
-- ...
(Re-)made a PR to fix the slow full parses: https://github.com/dbt-labs/dbt-common/pull/189
So if that gets merged and released we should no longer have need for the workarounds described above to avoid full parses 🙂
@fredriv Thanks for opening https://github.com/dbt-labs/dbt-common/pull/189 to resolve https://github.com/dbt-labs/dbt-core/issues/9037 🤩
I'm going to close this one as not planned since we intentionally made the related change.
Is this a regression in a recent version of dbt-core?
Current Behavior
We're using a "secret" env var (
DBT_ENV_SECRET_...
) in our dbt profile to avoid full parse when changing e.g. Snowflake warehouse between dbt runs, which does not require a full parse.This stopped working after dbt 1.6.12 due to the change in https://github.com/dbt-labs/dbt-core/pull/9844. It now creates a profile hash of all the keys in the connection info but does not mask values that come from "secret" vars into
*****
(unlike e.g.dbt debug
).Due to the size of our dbt project (2000+ models), this adds around 4-5 minutes of parse time to our dbt runs.
Have tested and found this behavior with dbt 1.6.18, 1.7.18 and 1.8.4.
Expected/Previous Behavior
The issue (https://github.com/dbt-labs/dbt-core/issues/9795) related to the change above mentions the workaround to avoid full reparses by using "secret" env vars, and states in the expected behavior that
Users should retan the ability to "opt" out additional fields by setting them to "secret" env vars, the values of which dbt does not track
. However this is not the case since dbt 1.6.12.We expect the previous functionality to work, being able to avoid full reparses by using "secret" env vars in the profile connection info.
Steps To Reproduce
With dbt 1.6.12 or later release, create a
profile.yml
with a "secret" env var in one of the connection parameters, e.g.Run a dbt command that requires parsing, such as
dbt ls -s my_model
. Re-run the command but set the environment variable to a different value, e.g.Observe that this triggers a full parse and prints the message
Unable to do partial parsing because profile has changed
.Relevant log output
No response
Environment
Which database adapter are you using with dbt?
snowflake
Additional Context
No response