[CT-1122] [CT-1116] [Bug] `adapter.get_relation` for relations in other databases [RA3]

dbt-labs / dbt-redshift

dbt-redshift contains all of the code enabling dbt to work with Amazon Redshift

https://getdbt.com

Apache License 2.0

98 stars 59 forks source link

[CT-1122] [CT-1116] [Bug] `adapter.get_relation` for relations in other databases [RA3] #179

Open jaklan opened 2 years ago

jaklan commented 2 years ago

Is this a new bug in dbt-core?

[X] I believe this is a new bug in dbt-core
[X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

adapter.get_relation seems to ignore database argument and use the target database anyway.

Expected Behavior

The right database is used.

Steps To Reproduce

I have a model, defined as below, which checks if it already exists in the same schema, but in another table. If it exists - it prints a log message and reads data from there:

{{ config({ "materialized":"table" }) }}

{%- set relation = adapter.get_relation(
  database="integrated",
  schema=this.schema,
  identifier=this.identifier) -%}

{% if relation is not none %}

{{ log(relation.database ~ '.' ~ relation.schema ~ '.' ~ 
     relation.identifier ~ " already exists", info=true) }}

select * from {{relation}}

{% else %}

... (e.g. creates an empty table or do nothing if exists)

{% endif %}

I have 2 databases: dev and integrated.
In dev I have that table already created, in integrated - not.
The database specified in the active target is dev.

Now, when running dbt run -m "<model_identifier>", I would expect it to do nothing in dev database. But, instead of that I get:

02:41:51  1 of 1 START table model some_schema.some_model ................... [RUN]
02:41:52  integrated.some_schema.some_model already exists
02:41:52  1 of 1 ERROR creating table model some_schema.some_model .......... [ERROR in 0.80s]

02:41:55  Completed with 1 error and 0 warnings:
02:41:55  
02:41:55  Database Error in model <model_identifier>
02:41:55    Schema some_schema does not exist in the database.

so adapter.get_relation claims to be able to find the table in the integrated database - although it doesn't exist there. And because it doesn't exist - the run fails when it tries to actually read the data from there.

PS that's only a dummy example to visualise the issue, please don't focus on the logic itself

Relevant log output

No response

Environment

- OS: macOS Monterey 12.5.1
- Python: 3.9.11
- dbt:

  Core:
  - installed: 1.2.1
  - latest:    1.2.1 - Up to date!

  Plugins:
  - redshift: 1.2.1 - Up to date!
  - postgres: 1.2.1 - Up to date!

Which database adapter are you using with dbt?

redshift

Additional Context

No response

jaklan commented 2 years ago

Okay, I get some more info. I have added debug statement:

{%- set relation = adapter.get_relation(
    database="integrated",
    schema=this.schema,
    identifier=this.identifier) -%}

{{ debug() }}

{% if relation is not none %}

{{ log(relation.database ~ '.' ~ relation.schema ~ '.' ~ 
       relation.identifier ~ " already exists", info=true) }}

select * from {{relation}}

and then I got:

     20     context.vars['relation'] = l_0_relation
     21     context.exported_vars.add('relation')
---> 22     yield to_string(environment.call(context, (undefined(name='debug') if l_0_debug is missing else l_0_debug)))
     23     yield '\n'
     24     if (not t_1((undefined(name='relation') if l_0_relation is missing else l_0_relation))):

But now, what's interesting - the debug mode was entered twice: at the beginning of the run (so as I assume during compilation) and during execution.

During the first visit, I got:

ipdb> l_0_relation is None
True

which is correct. But - during the second visit (so the one which started just after 03:13:24 1 of 1 START table model some_schema.some_model ................... [RUN]) I got:

ipdb> l_0_relation
<RedshiftRelation "integrated"."some_schema"."some_model">

so the cause is hidden somewhere here.

jaklan commented 2 years ago

Looking at docs that behaviour actually makes sense, because it's related to the difference between the parse phase and the execution phase:
https://docs.getdbt.com/reference/dbt-jinja-functions/execute
and it also explains why l_0_relation was initially None - because adapter.get_relation wasn't triggered. But there's still an issue in the execution phase as it should be evaluated to None also then.

jtcohen6 commented 2 years ago

@jaklan Thanks for the thorough writeup, and for detailing your follow-up investigation! An important piece of this puzzle: This is a cross-database query on Redshift. I'm assuming that you're using RA3 nodes.

I have a sense of what's going on here. Could you check the logs that are run during the second visit (execution time)? I would expect dbt to be running a query along these lines, since you're asking it to perform a lookup of "integrated"."some_schema"."some_model", and the schema "integrated"."some_schema" is not already cached:

    select
      'integrated' as database,
      tablename as name,
      schemaname as schema,
      'table' as type
    from pg_tables
    where schemaname ilike 'some_schema'
    union all
    select
      'integrated' as database,
      viewname as name,
      schemaname as schema,
      'view' as type
    from pg_views
    where schemaname ilike 'some_schema'

See the issue? It's trying to access information about "integrated"."some_schema", but by querying pg_tables + pg_views in the current database (target.database). Instead, we'd need this to use a cross-database query, and the Redshift docs tell us that pg_catalog is not supported:

To get metadata across databases, use SVV_ALL and SVV_REDSHIFT metadata views. You can't use the three-part notation or external schemas to query cross-database metadata tables or views under information_schema and pg_catalog.

Proposed resolution

In cases where we're trying to access metadata on relations outside the current database, we need redshift__list_relations_without_caching to use svv_tables, instead of the postgres__ version of the macro (pg_tables + pg_views).

I don't have a good sense of whether accessing that view is faster or slower than the pg_* views, or if there are any other gotchas. If it's faster or the same speed, we could switch to using svv_tables always. If it's slower, we should have conditional logic to use it only when needed (if schema_relation.database != target.database), otherwise pg_* views when available.

I'm going to mark this one a help_wanted, and move it over to dbt-redshift. We'd welcome community contribution for it.

Logistical note: We don't currently have RA3 nodes running in CI, since they're still much more expensive at the lowest end than 2nd-gen node types, but we should take another look at finally setting some up to test this sort of functionality.

jaklan commented 2 years ago

Hello @jtcohen6, thanks for the research and the detailed explanation! It definitely seems to be the direct reason and yes, we use RA3 nodes.

Such usage of adapter.get_relation is probably not very common in standard flows (but of course could be pretty misleading when happens - maybe we should at least add some warning as for now?), but it becomes problematic when you utilise defer - because then you might want to use database=this.database or database=ref(some_model).database, as you don't know beforehand in which database given model would be found - the target one or the "fallback" one. With that pitfall - unfortunately it's not possible to achieve, so you always refer to the target db, which can lead to very surprising results.

Regarding the implementation dilemmas - is there at least chance you decide on some recommended approach internally, so it would be more clear from the community perspective what MR would be accepted?

jtcohen6 commented 2 years ago

A warning is definitely better than the status quo — though I think the actual resolution to this bug may be quick here as well.

My instinct here is that pg_tables is faster than svv_tables when it's available. This is hinted at in Redshift docs, e.g.:

Amazon Redshift table names are stored in both PG_TABLES and STV_TBL_PERM; where possible, use PG_TABLES to return Amazon Redshift table names.

So I think the right move will be to reimplement redshift__list_relations_without_caching to use pg_* where possible, and svv_tables where necessary. Something like:

{% macro redshift__list_relations_without_caching(schema_relation) %}
  {% if schema_relation.database == target.database %}
    {{ return(postgres__list_relations_without_caching(schema_relation) }}
  {% else %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    select
      table_catalog as database,
      table_name as name,
      table_schema as schema,
      table_type as type
    from svv_tables
    where table_catalog ilike '{{ schema_relation.database }}'
      and table_schema ilike '{{ schema_relation.schema }}'
  {% endcall %}
  {{ return(load_result('list_relations_without_caching').table) }}
{% endmacro %}

You can actually give that a try by copy-pasting that macro code into your own project first — dbt will prefer your version over its own built-in one — to see if that resolves the bug.

jaklan commented 1 year ago

Hi @jtcohen6,

I've returned to the issue recently as this a key blocker for us to adopt a local development workflow, which is based on creating proper proxy views in dev database when upstream tables exist in the uat database. dev db is the one specified in the active target, so due to the issue - we can't verify the existence of tables in uat database.

There's also another solution for local development which was presented to us today by dbt Core Team:
https://github.com/LewisDavies/upstream-prod but as far as I can see - it would be affected by the issue as well: https://github.com/LewisDavies/upstream-prod/blob/43ede2a1913235c3db7ca8c94d89eadaf04fe127/macros/ref.sql#L36-L40

So, I have tested the custom postgres__list_relations_without_caching macro you proposed, but unfortunately it didn't work due to:

Encountered an error while running operation: Runtime Error
  maximum recursion depth exceeded while calling a Python object

I will try to debug that and find some solution in the upcoming days, but I would really appreciate prioritising that issue - especially when it directly affects approaches recommended by dbt consultants.

b-per commented 1 year ago

Hi @jaklan, I believe that the error is because of this

{% macro postgres__list_relations_without_caching(schema_relation) %}
  {% if schema_relation.database == target.database %}
    {{ return(postgres__list_relations_without_caching(schema_relation) }}
  {% else %}
...
{% endmacro %}

The function is calling itself recursively forever if the first if is True.

With a bit of copy/pasting from the original macro it could look like:

{% macro postgres__list_relations_without_caching(schema_relation) %}
  {% if schema_relation.database == target.database %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    select
      '{{ schema_relation.database }}' as database,
      tablename as name,
      schemaname as schema,
      'table' as type
    from pg_tables
    where schemaname ilike '{{ schema_relation.schema }}'
    union all
    select
      '{{ schema_relation.database }}' as database,
      viewname as name,
      schemaname as schema,
      'view' as type
    from pg_views
    where schemaname ilike '{{ schema_relation.schema }}'
  {% endcall %}
  {{ return(load_result('list_relations_without_caching').table) }}
  {% else %}
  {% call statement('list_relations_without_caching', fetch_result=True) -%}
    select
      table_catalog as database,
      table_name as name,
      table_schema as schema,
      table_type as type
    from svv_tables
    where table_catalog ilike '{{ schema_relation.database }}'
      and table_schema ilike '{{ schema_relation.schema }}'
  {% endcall %}
  {{ return(load_result('list_relations_without_caching').table) }}
{% endmacro %}

jtcohen6 commented 1 year ago

Sorry! I meant to name the macro in my example above redshift__list_relations_without_caching. I've just edited it to reflect that change.

jaklan commented 1 year ago

Thanks @jtcohen6 and @b-per, good catch! I tried both variants you pasted and they don't trigger the error, but they return None for tables which for sure exist in the target db - I will have a further look at that tomorrow.

jaklan commented 1 year ago

Back to you guys - I was able to make it work by using svv_redshift_tables instead of svv_tables and adjusting the column names:

{% macro redshift__list_relations_without_caching(schema_relation) %}
  {% if schema_relation.database == target.database %}
    {{ return(postgres__list_relations_without_caching(schema_relation)) }}
  {% else %}
    {% call statement('list_relations_without_caching', fetch_result=True) -%}
      select
        database_name as database,
        table_name as name,
        schema_name as schema,
        table_type as type
      from svv_redshift_tables
      where database_name ilike '{{ schema_relation.database }}'
        and schema_name ilike '{{ schema_relation.schema }}'
    {% endcall %}
    {{ return(load_result('list_relations_without_caching').table) }}
  {% endif %} 
{% endmacro %}

svv_tables only returned schemas and tables for target database, that's why I got None previously. But generally there's still an issue with performance, as querying svv_redshift_tables seems to be pretty slow with the current query.

There's also svv_all_tables which is a superset of svv_redshift_tables and works as well, but it's obviously even slower.

b-per commented 1 year ago

Could you share a ballpark figure of how long querying svv_redshift_tables takes in your case?

jaklan commented 1 year ago

@b-per

select
table_catalog as database,
table_name as name,
table_schema as schema,
table_type as type
from SVV_TABLES
where table_catalog ilike 'dev'
and table_schema ilike 'some_schema'

takes 1 row(s) fetched - 119ms

select
database_name as database,
table_name as name,
schema_name as schema,
table_type as type
from SVV_REDSHIFT_TABLES
where database_name ilike 'dev'
and schema_name ilike 'some_schema'

takes 1 row(s) fetched - 5.517s

For the context:

select distinct
table_catalog as database
from SVV_TABLES

gives 1 row(s) fetched - 2.621s and

select count(*)
from SVV_TABLES

returns 5179

select distinct
database_name as database
from SVV_REDSHIFT_TABLES

gives 38 row(s) fetched - 5.404s and

select count(*)
from SVV_REDSHIFT_TABLES

returns 28263

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

jaklan commented 1 year ago

Bump to un-stale

dataders commented 7 months ago

re-opening bc this was not fixed by #428 only addressed the redshift__list_relations_without_caching macro, not redshift_get_columns_in_relation