LewisDavies / upstream-prod

A dbt package for easily using production data in a development environment.
35 stars 4 forks source link

redshift adapter.get_relation always returns null? #8

Closed tomasmv closed 1 year ago

tomasmv commented 1 year ago

Hello!

Thank you for creating this project, it's exactly what my team and I were looking for! I'm trying to get this setup on our redshift cluster but I'm hitting an error. My adapter.get_relation call at the end keeps returning None. I've double checked the database, schema, and identifier inputs and everything looks good. It's just failing to find the relation using the function. Is there something extra I need to setup in my profiles or manifests that I'm missing? In my dbt project I've got upstream_prod_database: stg and we're using a custom generate_schema_name macro so I'm not specifying anything on the schema front.

When I carve up the package and do something like {% set prod_ref = parent_ref | replace(parent_ref.database, prod_database) %} it works fine. So, I'm guessing there's something about get_relation specifically I'm not understanding.

LewisDavies commented 1 year ago

This sounds like a config issue. If you're using a custom schema name macro you'll want to set upstream_prod_env_schemas: True and upstream_prod_schema to the same schema name in your production target.

You only need to set upstream_prod_database if your dev and prod environments are in different databases. If they are, this should be set to your prod database. You mention setting it to stg - is this definitely your production database?

Give that a try and let me know if it still doesn't work. I've tried to outline the setup steps in the readme but I'd welcome any improvement suggestions!

tomasmv commented 1 year ago

Ahhh you're right! it was the env schemas variable I didn't have set properly. To your question about databases, we have two redshift clusters a prod and a stage redshift cluster. all of our development happens in the stage redshift cluster where we keep separate databases for dev, and then a stg database that runs all of our code prior to releasing. We wanted dev to use the stg database and I have that working out ok.

I'm using upstream_prod_disabled_targets to filter out our prod target, but do you have anything similar except for schemas? say I don't want to use upstream prod for anything in the seed schema of the dev database, is that supported at all?

tomasmv commented 1 year ago

Ok, been digging more into this since I keep having the same issue. I ended up going through some of the dbt postgres adapter source code and I'm not sure doing cross db adapter.get_relation is allowed. The list of tables/schemas is limited to only the current database if I'm reading this right https://github.com/dbt-labs/dbt-core/blob/main/plugins/postgres/dbt/include/postgres/macros/adapters.sql#L81-L101. So adapter.get_relation won't be able to find the table in my upstream_prod_database.

LewisDavies commented 1 year ago

Hmm, I see what you mean. I see that get_relation calls list_relations, which uses list_relations_without_caching. However, it looks like the database arg is passed through to here which should let the macro use the right database.

Can you see any info about this event in your logs? Try searching for "code": "E014" and see if your table is listed.

I don't want to use upstream prod for anything in the seed schema of the dev database, is that supported at all?

Not at the moment but it's a good idea. Would you want to exclude based on the schema or project directory (or maybe both)?

JackPloshnick commented 1 year ago

I am not sure if it's related, but I am also having an issue with adapter.get_relation. We have multiple production schemas, but each developer has a single developer schema.

Referencing the documentation here, I tried:

{%- set source_relation = adapter.get_relation(
      database=this.database,
      schema="dbt_jploshnick",
      identifier=this.name) -%}

{{ log("Source Relation: " ~ source_relation, info=true) }}

This will return the table name. However, if I run:

{%- set source_relation = adapter.get_relation(
      database=this.database,
      schema="int_core",
      identifier=this.name) -%}

{{ log("Source Relation: " ~ source_relation, info=true) }}

None is returned. Any idea why that is happening? We have modified our generate_schema_name macro and our generate_alias_name, which I imagine is causing the issue.

Thanks so much for putting this package together! If I can get it to work for us, it'll be a huge help.

JackPloshnick commented 1 year ago

I guess what i can't figure out is, assuming I had changed nothing, how would the macro be able to know about the int_core version of the model? The int_core version was created in a cloud job, while the dbt_jploshnick version is created in the cloud IDE. Do I need to import that cloud job's manifest for this package to work?

LewisDavies commented 1 year ago

@JackPloshnick Thanks for adding this. Are you also using Redshift?

This is where we get the schema that would be used in prod when you're using custom schema names. However, I think I've spotted a bug that means only the prefix would be returned, e.g. int. I'll look into fixing it and improving the tests.

Lastly, would you mind sharing your custom schema & alias macros? It's really helpful for me to see how other dbt projects are set up so the package works with as many as possible. I'm on the dbt Slack if you'd rather share them privately.

LewisDavies commented 1 year ago

@tomasmv @JackPloshnick I've opened a PR that should fix the errors you're seeing with None being returned. Could you please test it out? You can use the branch by adding this to packages.yml:

packages:
  - git: https://github.com/LewisDavies/upstream-prod.git
    revision: fix-env-schema-target
LewisDavies commented 1 year ago

9 has been merged which should fix the errors you're seeing. You'll need to tweak your custom generate_schema_name macro - update instructions are in the README. The package will now print the fully-qualified name of the relation it's trying to find instead of returning None, so it should be easier to understand any future issues.

FYI @tomasmv I understand that cross-database queries may not work in Redshift unless you're using RA3 nodes. Thanks to @JackPloshnick for pointing that out.

I'll probably create a new release today or tomorrow. The instructions above will still work if you want to try it out now. Thanks for your help!