dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.61k stars 1.59k forks source link

[CT-260] [Bug] `generate_database_name` does not apply to sources #4753

Closed alexrosenfeld10 closed 6 months ago

alexrosenfeld10 commented 2 years ago

Is there an existing issue for this?

Current Behavior

When using sources, the generate_database_name override doesn't apply to sources.

Expected Behavior

Sources would take the overridden database name, just like models.

This is either a very big oversight, or for some reason it's intentional. It's quite a blocker for me though. I've tried to get around it by adding some jinja in my sources yaml block, but no luck so far. I'm even trying to call macros from there, which I don't think is possible.

I suppose leaving database explicitly unset might be an OK workaround, as it would take the database of the session which most of the time would be correct.

Steps To Reproduce

  1. Set up a dbt project
  2. Configure a source and a model that consumes the source.
  3. Add a generate_database_name override macro. Here's an example:
{% macro generate_database_name(custom_database_name=none, node=none) -%}

  {%- if custom_database_name is not none and node.resource_type in ['seed', 'model'] -%}
    {%- set error_message -%}
      {{ node.resource_type | capitalize }} '{{ node.unique_id }}' has a database configured. This is not allowed because the it will be derived from the directory structure and target name.
    {%- endset -%}
    {{ exceptions.raise_compiler_error(error_message) }}
  {%- endif -%}

  {% set path = node.path %}
  {% set database = path.split('/')[0] %}
  {% if target.name in ['prod_us', 'prod_eu'] %}
    {{ database }}
  {% else %}
    {{ database + "_" + target.name }}
  {% endif %}

{%- endmacro %}
  1. dbt compile
  2. The compiled models will have whatever db is set in your profile, instead of the overridden one from the macro

Relevant log output

No response

Environment

- OS: macOS
- Python: 3.7.9
- dbt: 

❯ dbt --version
installed version: 1.0.2
   latest version: 1.0.2

Up to date!

Plugins:
  - snowflake: 1.0.0 - Up to date!


### What database are you using dbt with?

snowflake

### Additional Context
alexrosenfeld10 commented 2 years ago

After working on this for a while I finally have come to a workaround:

sources:
  - name: the_thing_that_is_in_a_different_db_depending_on_things
    database: "reporting{{ '_' + target.name if not target.name.startswith('prod') }}"
    schema: schema_is_consistent_so_its_hard_coded
    tables: [ ... ]
alexrosenfeld10 commented 2 years ago

btw, I'm using target.name because our dbs are named as such:

reporting_dev, reporting_qa, reporting (this one is prod).

We also give each engineer their own, like this: reporting_qa_someonesname.

Is there a better practice for doing this? I know you can use environment variables, but I didn't find it to be much cleaner because I also have another db, let's call it other_reporting that follows the same naming conventions. They interact, and I want to do things to both dbs in the same dbt project in the same run. Because of this, it's pretty much useless to set the database in the profile...

My other option with env vars leaves me having to set the db in each yml file / in each model config, which is what led me to the override macro approach - it's DRYer and less error prone (someone could forget to set the db in the yml file / each model, but not with a macro override).

nathan-protempo commented 2 years ago

For my 2c I think this is working as expected. The generate_database_name override controls the database used for dbt models. Sources are not models in dbt, they are references to external data and so not governed by the dbt configuration for target database since source data is generally in a different database than models generated by dbt.

nathan-protempo commented 2 years ago

@alexrosenfeld10 probably useful to include the config from your source and model files in the issue since this could just be a config issue. It's not clear from the information given if you are calling the macro in the source config.

alexrosenfeld10 commented 2 years ago

source data is generally in a different database than models generated by dbt.

That's quite a generalization to make! It's super common to EL something into one schema and then do transformations on it in another schema. I don't see any reason why the expectation should be that those are in separate databases. Moreover, if they are in separate databases, then that's even more reason for such an override to exist.

@nathan-protempo Here's my source config: https://github.com/dbt-labs/dbt-core/issues/4753#issuecomment-1045782510. The idea is that I can DRY the database qualifier up by putting it at the source level, not the model levels. You might suggest to set that in the profile, and I thought about that as well. That's not useful to me because my project selects from multiple databases (controlled by models/<database_name> + the macro I posted above). So my options are either:

  1. specify the database explicitly in every model / source, which is manual and error prone and very non-DRY
  2. do the sort of crazy workaround I have, where directory name == database name + macro to go with, model name follows a schemaName__modelName.sql convention + macro to with, and finally an alias override macro to remove the schemaName__ prefix from models before aliasing.
  3. Maybe a third option you can suggest that I didn't think of? I'd love that!

Thanks for helping to think this through. I realize it's possible I might have missed a good option, if so, I'm all ears. If not, I think this is a change that the dbt Labs team should consider. It doesn't have to be the same override even, a separate one just for sources would work well enough.

nathan-protempo commented 2 years ago

@alexrosenfeld10 - Yes in hindsight that is just me generalising my experience and of course there is no reason you couldn't have source tables in a different schema of the same database.

I tried to do some local testing with your custom database macro and it generated errors as the initial show terse objects queries because the target was in the form project_name\folder.default_schema. I'm not sure if this is what you intended as I don't know what your folder structure or profile yml looks like. It could also be an environment issue as I am using Windows.

I note that your macro code uses the "target" function - there's an example on the doc page for that function for using target.name to override the source database, I wonder if that might help?

alexrosenfeld10 commented 2 years ago

@nathan-protempo I think the environment-specific problem could potentially be the direction of the '/' character used in my macro. Not sure how to get around that.. I'll have to think about it.

Anyway, I've changed my approach a little bit. It's still the same core ideas though. Here's what I do now (it works pretty well):

{% macro generate_schema_name(custom_schema_name=none, node=none) -%}

  {%- if custom_schema_name is not none and node.resource_type in ['seed', 'model'] -%}
    {%- set error_message -%}
      {{ node.resource_type | capitalize }} '{{ node.unique_id }}' has a schema configured. This is not allowed because the it will be derived from the directory structure.
    {%- endset -%}
    {{ exceptions.raise_compiler_error(error_message) }}
  {%- endif -%}
{#
Because we have many models with the same name in different schemas, we have to follow the
schema_name__model_name.sql naming convention.

dbt enforces that model filenames are unique, and does not support schema names in the ref() function.

See https://discourse.getdbt.com/t/extracting-schema-and-model-names-from-the-filename/575 and
https://discourse.getdbt.com/t/preventing-name-conflicts/221 for more information.

#}
  {{ node.name.split('__')[0] | trim }}
{%- endmacro %}

{% macro generate_alias_name(custom_alias_name=none, node=none) -%}
  {%- if custom_alias_name is not none and node.resource_type in ['seed', 'model'] -%}
    {%- set error_message -%}
      {{ node.resource_type | capitalize }} '{{ node.unique_id }}' has an alias configured. This is not allowed because the it will be derived from the file name.
    {%- endset -%}
    {{ exceptions.raise_compiler_error(error_message) }}
  {%- endif -%}
    {{ node.name.split('__')[1] }}
{%- endmacro %}

{% macro generate_database_name(custom_database_name=none, node=none) -%}

  {%- if custom_database_name is not none and node.resource_type in ['seed', 'model'] -%}
    {%- set error_message -%}
      {{ node.resource_type | capitalize }} '{{ node.unique_id }}' has a database configured. This is not allowed because the it will be derived from the directory structure and target name.
    {%- endset -%}
    {{ exceptions.raise_compiler_error(error_message) }}
  {%- endif -%}

  {% set path = node.path %}
  {% set database = path.split('/')[0] %}
  {{ database }}{{ '_' + target.name if not target.name.startswith('prod') }}
{%- endmacro %}

Note that this requires your databases to follow the convention of db_name_<target_name>. For example, ours are reporting_dev, reporting_qa, and reporting. We also have other_org_reporting_dev, other_org_reporting_qa, and other_org_reporting (the two orgs interact with each other in some places, which is the root of why I really need this, and can't just set it once in my profile.yml). Therefore the targets we configure are dev, qa, and prod_us / prod_eu etc.

With this approach, you can very clearly see why I want the overrides to apply to sources as well. Instead, I have to do this in my sources:

sources:
  - name: blah
    database: "reporting{{ '_' + target.name if not target.name.startswith('prod') }}" # for sources in other_org, I have to change it to "other_org{{ ... same exact conditional here ... }}"
    schema: schema_name
    tables: [ ... ]
alexrosenfeld10 commented 2 years ago

That example is cool, I didn't know you could multi-line and use full {% if %} style jinja. I was under the impression it had to be inline, in one {{ evaluation statement }}. This is nice to know about in general, but it doesn't address the core problem here. In fact, I'm effectively doing that to get around this problem, I'm just doing it with a one line if statement + a convenient naming convention.

jtcohen6 commented 2 years ago

Jumping in with three quick thoughts, since there's already some great conversation happening here:

  1. @nathan-protempo has the right sense for why dbt doesn't support generate_database_name for sources by default. dbt assumes that (a) sources are preexisting tables with predetermined locations, not things it is tasked with putting in place; and (b) those locations don't vary across environments. The goal behind the default custom schema behavior is to avoid two developers putting the same model in the same namespace location, and stepping on each others' toes. The same logic applies for why snapshots use target_database + target_schema configs instead, and don't use generate_X_name macros—it should be the same for all users. Obviously, in many complex deployments, these objects can vary across environments—sources in dev/staging are sampled, cleansed of PII, perhaps even faked entirely.
  2. Because snapshots are written in dbt-SQL, users can still set target_database = generate_database_name('db') if they so choose. The default is justifiable, and the alternative is still possible. Unfortunately, dbt doesn't support macro calls within .yml configuration today. (Exception: "late-rendered" config, such as hooks.) This is a capability that we've continually worked toward, and we've laid a number of the foundational pieces. The reason we haven't just done this: adding macros to the rendering contexts for configurations will be a significant performance hit at parse time. Could we add support for just generate_X_name macros, since those already have special behavior during parsing? Maybe. That would make this possible, at least:
    # not possible today
    sources:
    - name: blah
    database: "{{ generate_database_name('reporting') }}"
    schema: schema_name
    tables: [ ... ]
  3. We will soon be working to reconcile configurations for sources, so that they are at parity with other resource types (#3662). This is work we began ahead of v1.0, but de-prioritized relative to other changes. This will at least make the following behavior possible:
    
    # dbt_project.yml

sources: my_project: all_my_sources:

logic is still duplicated, but compact - not scattered across different files

  reporting_sources:
    +database: "reporting{{ '_' + target.name if not target.name.startswith('prod') }}"
  other_org_sources:
    +database: "other_org{{ '_' + target.name if not target.name.startswith('prod') }}"
alexrosenfeld10 commented 2 years ago

Good thoughts, thanks @jtcohen6. Of those three I am most a fan of the third option.

It's not uncommon for people to use the same physical instance of a cloud database with separate logical (_qa kind of thing) databases, to save on $$. Also, it's not uncommon for folks to do what I'm doing where separate sub-orgs within a company go in separate databases. I understand the assumption dbt is currently making that sources are expected to be in a known location, my point here is to call into question the validity of that assumption.

My main question for you is, why isn't there a 4th option, a generate_X_name_sources (where x is database and schema)? Same exact idea as what we do for models, just applied to sources.

Similarly, you did say that dbt "doesn't support generate_x_name for sources by default", which implies it can be enabled. Where can I change that setting (Perhaps there is no way to and I'm just reading in a bit too carefully)?

jtcohen6 commented 2 years ago

Similarly, you did say that dbt "doesn't support generate_x_name for sources by default", which implies it can be enabled. Where can I change that setting (Perhaps there is no way to and I'm just reading in a bit too carefully)?

Sorry, yes, no way to do this today. I meant option 2 to be an illustration of how you could someday reenable this, for a particular source, in the same way users have done with snapshots.

So option 4 might be, an optional configuration that "enrolls" sources into generate_x_name macros (or a dedicated source-specific version). Behind the scenes, this would need to look like running update_parsed_node_name over sources when they're parsed.

For the short term, I still like the third option most. We're planning to start work on that soon, with the aim of including it in v1.1. After that capability is unlocked, we can see if there remains a significant unmet need for more dynamic, macro-based configuration.

alexrosenfeld10 commented 2 years ago

Cool. The thing I like about having a way to enroll sources into generate_x_name macros is it could apply to more than just sources (right?). Right now I'm not using analyses or snapshots, but soon enough I'm sure I will be, and I'll need the same functionality there too. This kind of thing is a project-structure level thing, so it's going to apply for all pieces. For now, I'll have to come up with workarounds (like what I did for sources) and copy-paste them everywhere.

I'm a fan of 3, but per the above I still think a more global, configurable option would be great

alexrosenfeld10 commented 2 years ago

FWIW, I think anyone using something like this: https://discourse.getdbt.com/t/extracting-schema-and-model-names-from-the-filename/575 would be impacted by the same / a similar problem(s), at some level somewhere in their project.

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

alexrosenfeld10 commented 1 year ago

Commenting to keep this open

Callumoco commented 1 year ago

I've just come across this behaviour too and was struggling to understand why sources weren't changing their database value even though I configured the macro to do so.

In my case I want any source that uses a PROD_ database to use the DEV_ database instead, when the non-live profile is used.

# generate_database_name.sql

# Does not work for sources

{% macro generate_database_name(custom_database_name, node) -%}

    {%- set default_database = target.database -%}
    {%- if custom_database_name is none -%}

        {{ default_database }}

    {%- elif target.name == "non-live" and custom_database_name[:5] | upper == "PROD_" -%}

        {{ "DEV_" ~ custom_database_name[5:] }}

     {%- else -%}

        {{ custom_database_name | trim }}

    {%- endif -%}

{%- endmacro %}

For now the below Jinja logic works, it's just not very user friendly for developers when it comes to adding a new source. Would love for this to be a flag or setting in a future version.

# sources.yml
version: 2

sources:
  - name: source_name
    database: "{{ 'dev_database' if target.name == 'non-live' else 'prod_database' }}"
alexrosenfeld10 commented 1 year ago

I have the same workaround in place. Mine's a tiny bit different, posting in case you prefer:

version: 2

sources:
  - name: foo
    description: The foo data sources
    database: "reporting{{ '_' + target.database if not target.database.startswith('prod') }}"
    schema: foo
    tables:
      - name: a_table
github-actions[bot] commented 6 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 6 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

avjkumar commented 3 months ago

👋 We do need to programmatically define sources for various security and compliance reasons. Is this a feature we can request or is there an update on this?