[CT-333] Macro for warehouse selection

javiCalvo commented 2 years ago

Describe the feature

It would be nice to have a macro similar to generate_schema_name for the selection of the snowflake warehouse to use. For example for change the warehouse depending on the environment.

Describe alternatives you've considered

I tried to do a pre hook with the use warehouse query but don't gives the flexibility needed, for example if you use a wildcard in the name all the queries that don't use the pre hook will fail.

Additional context

Please include any other relevant context here.

Who will this benefit?

Every one that wants more control and flexibility on the warehouse selected.

Are you interested in contributing this feature?

Yes, I have some idea about how to do it. I've performed some local tests and it's working. With some help as it's my first contribution I will be able to raise a PR.

VersusFacit commented 2 years ago

Hi @javiCalvo. Appreciate the issue submission and the PR!

Just to be clear, 'warehouse' here refers to "virtual warehouse," right? Want to make sure we define our terms at the outset!

I tried to do a pre hook with the use warehouse query but don't gives the flexibility needed, for example if you use a wildcard in the name all the queries that don't use the pre hook will fail.

I'm hoping I read this right. Although I'm familiar with both terms, I feel like I'm missing some context as to what "wildcard" and "pre hook" refer to in this precise context.

I was scratching my brain about what could possibly make this more flexible. I can't necessarily give you Bash-style wildcard matching. Nor does Jinja 2 have a built-in filter by matching by full regexes. But, you can achieve a rather flexible selection of warehouses based on specified target like so.

We have a built-in feature for this. Consider the following model configuration option: +snowflake_warehouse.

models:
  +snowflake_warehouse: "EXTRA_SMALL"
  my_project:
    clickstream:
      +snowflake_warehouse: "EXTRA_LARGE"

For added flexibility in selecting the warehouse, you can use some house-favorite Jinja. This lets you build something like so:

+snowflake_warehouse: |
     {%- if false -%} any_boolean_can_be_used
     {%- elif target.name == "qa"  -%} use_any_custom_target_string
     {%- elif "substring" in target.name -%} substring_matching
     <any boolean you can achieve with Jinja>
     {%- else -%} you_even_get_a_default_option
     {%- endif -%}

Lastly, you can use an automatically loaded regex python module to get a true regex match:

{% set my_string = 's3://example/path' %}
{% set s3_path_pattern = 's3://[a-z0-9-_/]+' %}
{% modules.re.match(s3_path_pattern, my_string, re.IGNORECASE) %}a  # returns a boolean, so you can use it in the pattern above

I did some tests and this value does in fact change the warehouse local variable in dbt/adapters/snowflake/impl.py:pre_model_hook, which calls self._use_warehouse.

This model configuration tag can be set with any level of desired variance for any singular model or group of models in your project hierarchy.

Therefore, from where I'm currently standing, I feel like this issue is already addressed by this built-in feature.

I'm going to lean this towards wont-fix and intend to close it. But, if I've missed something along the way in your reasoning/use-case, please let me know and I reassess that tagging :) Looking forward to your followup!

javiCalvo commented 2 years ago

Hi @VersusFacit ! Thanks a lot for the answer.

Just to be clear, 'warehouse' here refers to "virtual warehouse," right? Want to make sure we define our terms at the outset!

Yes, sorry for the ambiguity I was talking about snowflake's virtual warehouse.

We are using already the +snowflake_warehouse tag all over our dbt_project and we can insert there the jinja code as you suggest (in fact we are already doing it).

The improve is precisely to avoid having so much jinja code in the dbt_project, specially repeated code. We have a project which involves 12 snowflake databases and we are using around 60 different virtual warehouses splitted in 3 different environments. Righ now out dbt_project have the +snowflake_warehouse tag 9 times and growing so a macro as I suggest would be great to have a more controlled project.

Example of what we have now in the dbt_project: +snowflake_warehouse: "WH_{{ {'stg': 'D', 'tst': 'Q', 'prd': 'P'}[target.name] }}_IMPORT_XL" And this structure is working but makes our dbt_project somehow not pretty :)

jtcohen6 commented 2 years ago

I saw that https://github.com/dbt-labs/dbt-snowflake/pull/115 is still open, and then came across this discussion to read more.

Everything said above is super valid! After some more reflection, and time spent looking at https://github.com/dbt-labs/dbt-snowflake/pull/115, I'm going to reopen this issue for two reasons:

The way we've implemented use warehouse inside the codebase is pretty janky: a Python f-string. We should wrap SQL in Jinja wherever possible, since it's easier for us to keep consistent and more feature-ful for end users.
Unlike some parse-time behaviors, there's minimal complexity to supporting the full flexibility of a Jinja macro at the spot in dbt execution (runtime) where dbt is running use warehouse

@javiCalvo If this is something you'd still be interested in contributing, I'd be open to it. I'll leave a comment or two on the PR with recommendations for the implementation.

javiCalvo commented 2 years ago

Sure @jtcohen6 ! I'm still interested on this. I will review your PR comments and proceed that way. Thanks!

tjwaterman99 commented 2 years ago

+1 for this.

For anyone looking for a workaround, we ended up using a custom macro that gets called in the model config.

config(
  materialized='incremental',
  snowflake_warehouse=get_warehouse(size='xxl'),
  ...
)

The macro was similar to this:

{% macro get_warehouse(size=none) %}
{% if target.name not in ('prod', '...') %}
    {% do return(target.warehouse) %}
{% elif size == none %}
    {% do return(target.warehouse) %}
{% else %}
    {% set warehouse = "transforming_prod_" ~ size | lower %}
    {% do return(warehouse) %}
{% endif %}
{% endmacro %}

So this enabled models to "opt in" to the warehouse size they need by calling the get_warehouse macro, while models that didn't use the macro continued to run on the default warehouse from profiles.yml.

kamilamarcinekpgs commented 1 year ago

It would be awesome to have such macro, and be able to choose warehouse depending on:

environment
type of run (initial, full, incremental, test)

McKnight-42 commented 1 year ago

@javiCalvo closing due to being covered by pr #503

Fleid commented 1 year ago

Re-opening as we've reverted the PR solving it, the implementation was causing side effects.

manugarri commented 1 year ago

@Fleid as requested, im sharing my last comment in https://github.com/dbt-labs/dbt-snowflake/issues/533 to keep track of the bug:

my original intentions where to setup a macro that select different warehouses depending on whether an analyst is testing dbt locally versus running automated runs on airflow.

We use environment variables on our profiles.yml

snowflake:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      user: "{{ env_var('SNOWFLAKE_USER') }}"
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      database: "{{ env_var('SNOWFLAKE_DATABASE') }}"
      warehouse: "{{ env_var('SNOWFLAKE_WAREHOUSE') }}"
      role: "{{ env_var('SNOWFLAKE_ROLE') }}"
      authenticator: "{{ env_var('SNOWFLAKE_AUTHENTICATOR') }}"
      schema: public

My original implementation for the macro was something like this:

{%- macro snowflake_warehouse(warehouse) -%}
    {{ log("SELECTING SNOWFLAKE_WAREHOUSE:" ~ warehouse ~ " FOR USER: " ~ env_var('SNOWFLAKE_USER') , True) }}

    {% set warehouses_specs = {
            "service_user": {
                "MEDIUM": "DBT_MED",
                "LARGE": "DBT_LARGE"
            },
            "human_user": {
                "MEDIUM": "LOCAL_MED",
                "LARGE": "LOCAL_LARGE"
            }
        }
    %}
  #we use SSO for snowflake
    {% if "@" in env_var('SNOWFLAKE_USER')  %}
        {% if warehouse in warehouses_specs["human_user"] %}
            {{ return(warehouses_specs["human_user"][warehouse]) }}
        {% else %}
            {{ return(warehouse) }}
        {% endif %}
    {% else %}
        {% if warehouse in warehouses_specs["service_user"] %}
            {{ return(warehouses_specs["service_user"][warehouse]) }}
        {% else %}
            {{ return(warehouse) }}
        {% endif %}
    {% endif %}
{%- endmacro %}

this was failing with the error i mentioned.

In the end some very helpful person at the dbt slack helped me and suggested replacing env_var('SNOWFLAKE_USER') for target.user , which solves my current needs.

But still, it sounds like allowing environment variables in this macro like any other macro would be useful for other use cases.

Another use case that just arose at my company, we want to have different warehouses per environment, but we want a single profiles.yml (since the prod dbt runs on docker ECS but local runs for multiple analysts) if we have an environment variable to define the environment dbt is running, we can include that environment variable in the mapping function i shared above

manugarri commented 1 year ago

is there any way I can help with getting this feature developed? At my company we are at the verge of needing this feature, and I would rather use a standard process instead of using a custom one. Thanks!

Fleid commented 1 year ago

Hi @manugarri, The topic is thornier than it looks. We will have a deeper look shortly to find an alternative approach, but I'm not sure yet when we'll get to solving it out of the box. We may have a good workaround available soon though.

epapineau commented 1 year ago

@Fleid is there any update on this issue? We're currently running a full refresh on our biggest model and it reminded me that our snowflake_warehouse configuration for that model is inappropriately small for a full refresh.

Fleid commented 1 year ago

Nope, we had to lower the priority to get all of the materialized view implementations through the door. @dataders that could be a thing we try to put back on the table for Q4? The spike at least.

dataders commented 1 year ago

linking to a community Slack thread in #db-snowflake

a087861 commented 11 months ago

Following as this functionality would be very beneficial

manugarri commented 11 months ago

looking forward to it!

epapineau commented 11 months ago

huge~

manugarri commented 9 months ago

@Fleid any updates on this ?

mroy-seedbox commented 3 weeks ago

On our side, we just specify which warehouse to use for each run using a variable in profiles.yml:

warehouse: "{{ var('warehouse', 'default_warehouse') }}"

So if we need to do a full refresh on a large table, we just need to do: dbt run ... --vars '{warehouse: large_warehouse}'

The default can also come from an env var: warehouse: "{{ var('warehouse', env_var('SNOWFLAKE_WAREHOUSE', 'default_warehouse')) }}"

Bonus: no USE WAREHOUSE ... swaps between models during the run.

Downsides:

It doesn't work in dbt Cloud, since profiles.yml is not accessible there.
It only applies if all the models in a run share the same warehouse (which usually makes sense from a cost perspective, because starting multiple separate warehouses is more expensive).

mroy-seedbox commented 3 weeks ago

Here's another idea to make this feature much more flexible and benefit other use cases as well:

Rather than creating yet another macro which is specific only to this very use case, we could instead add the ability to use macros in dbt_project.yml. That would then allow us to configure our models like this (as mentioned by @VersusFacit, but using a macro instead):

models:
  +snowflake_warehouse: "{{ get_snowflake_warehouse() }}"

However, the assumption here is that get_snowflake_warehouse() would only be called once (rather than for each model), and the result stored/reused for everything.

In order to overcome that, we could consider adding another modifier (maybe another +?) in order to tell DBT to re-evaluate the config/macro for every single model (or source, etc.):

models:
  ++snowflake_warehouse: "{{ get_snowflake_warehouse() }}"

That way, the get_snowflake_warehouse() macro would be called for each model, as if it were in a config block (kinda like @tjwaterman99 is doing, but without all the copy-paste), and it would be able to access the model variable inside the macro in order to do things differently for every single model. Welcome, programmatic warehouse selection (and so much more)! 🎉 👋

To give an even better idea of how this would work, the following would basically be the equivalent of the ability to override macros which currently exists (but without having to override anything):

models:
  ++database: "{{ generate_database_name() }}"
  ++schema: "{{ generate_schema_name() }}"
  ++alias: "{{ generate_alias_name() }}"

And what is even nicer with this approach is that we can use different overrides for different parts of the config hierarchy, whereas macro overrides are inherently global by obligation:

models:
  ++alias: "{{ generate_alias_name() }}"
  project:
    special_db:
      ++alias: "{{ generate_special_alias_name() }}"

Obviously, this would require macros to be parsed first (before the configurations for models/sources/etc.). But it's not like macros have dependencies/refs on models (it's usually the other way around). And even if they do, this feature could be limited to a special "project macros" context, just like other dbt contexts (i.e. macros which have access to less stuff, namely cannot use ref()). However, these macros wouldn't even be executed when dbt_project.yml is parsed (just like generate_database/schema/alias_name() currently), but rather when each model is being parsed, which gives some extra time for the context to be more fully built.

One challenge though is around the order of execution. Right now (override or not), generate_database_name() is called first, then generate_schema_name(), and then generate_alias_name(). This means that the generated database name is available in generate_schema_name(), and both of those are available in generate_alias_name(). But when would get_snowflake_warehouse() be called? Does it need to know the database/schema/alias? Or does the database/schema/alias need to know the warehouse (probably not)? Anyway... something to think about. It could probably be solved by dependency resolution and circular dependency detection, although that is by no means a simple matter. Or maybe just generally avoid having configs depend on other configs (with a few exceptions, like for generate_database/schema/alias_name()), because that's really confusing (and also not really possible at the moment anyway).

manugarri commented 2 weeks ago

@mroy-seedbox thanks for the input, i dont think this feature is going to happen, there's been silence for almost a year :(

dbt-labs / dbt-snowflake