Open jaklan opened 2 years ago
Okay, I get some more info. I have added debug
statement:
{%- set relation = adapter.get_relation(
database="integrated",
schema=this.schema,
identifier=this.identifier) -%}
{{ debug() }}
{% if relation is not none %}
{{ log(relation.database ~ '.' ~ relation.schema ~ '.' ~
relation.identifier ~ " already exists", info=true) }}
select * from {{relation}}
and then I got:
20 context.vars['relation'] = l_0_relation
21 context.exported_vars.add('relation')
---> 22 yield to_string(environment.call(context, (undefined(name='debug') if l_0_debug is missing else l_0_debug)))
23 yield '\n'
24 if (not t_1((undefined(name='relation') if l_0_relation is missing else l_0_relation))):
But now, what's interesting - the debug
mode was entered twice: at the beginning of the run (so as I assume during compilation) and during execution.
During the first visit, I got:
ipdb> l_0_relation is None
True
which is correct. But - during the second visit (so the one which started just after 03:13:24 1 of 1 START table model some_schema.some_model ................... [RUN]
) I got:
ipdb> l_0_relation
<RedshiftRelation "integrated"."some_schema"."some_model">
so the cause is hidden somewhere here.
Looking at docs that behaviour actually makes sense, because it's related to the difference between the parse phase and the execution phase:
https://docs.getdbt.com/reference/dbt-jinja-functions/execute
and it also explains why l_0_relation
was initially None
- because adapter.get_relation
wasn't triggered. But there's still an issue in the execution phase as it should be evaluated to None
also then.
@jaklan Thanks for the thorough writeup, and for detailing your follow-up investigation! An important piece of this puzzle: This is a cross-database query on Redshift. I'm assuming that you're using RA3 nodes.
I have a sense of what's going on here. Could you check the logs that are run during the second visit (execution time)? I would expect dbt to be running a query along these lines, since you're asking it to perform a lookup of "integrated"."some_schema"."some_model"
, and the schema "integrated"."some_schema"
is not already cached:
select
'integrated' as database,
tablename as name,
schemaname as schema,
'table' as type
from pg_tables
where schemaname ilike 'some_schema'
union all
select
'integrated' as database,
viewname as name,
schemaname as schema,
'view' as type
from pg_views
where schemaname ilike 'some_schema'
See the issue? It's trying to access information about "integrated"."some_schema"
, but by querying pg_tables
+ pg_views
in the current database (target.database
). Instead, we'd need this to use a cross-database query, and the Redshift docs tell us that pg_catalog
is not supported:
To get metadata across databases, use SVV_ALL and SVV_REDSHIFT metadata views. You can't use the three-part notation or external schemas to query cross-database metadata tables or views under information_schema and pg_catalog.
In cases where we're trying to access metadata on relations outside the current database, we need redshift__list_relations_without_caching
to use svv_tables
, instead of the postgres__
version
of the macro (pg_tables
+ pg_views
).
I don't have a good sense of whether accessing that view is faster or slower than the pg_*
views, or if there are any other gotchas. If it's faster or the same speed, we could switch to using svv_tables
always. If it's slower, we should have conditional logic to use it only when needed (if schema_relation.database != target.database
), otherwise pg_*
views when available.
I'm going to mark this one a help_wanted
, and move it over to dbt-redshift
. We'd welcome community contribution for it.
Logistical note: We don't currently have RA3 nodes running in CI, since they're still much more expensive at the lowest end than 2nd-gen node types, but we should take another look at finally setting some up to test this sort of functionality.
Hello @jtcohen6, thanks for the research and the detailed explanation! It definitely seems to be the direct reason and yes, we use RA3
nodes.
Such usage of adapter.get_relation
is probably not very common in standard flows (but of course could be pretty misleading when happens - maybe we should at least add some warning as for now?), but it becomes problematic when you utilise defer
- because then you might want to use database=this.database
or database=ref(some_model).database
, as you don't know beforehand in which database given model would be found - the target one or the "fallback" one. With that pitfall - unfortunately it's not possible to achieve, so you always refer to the target db, which can lead to very surprising results.
Regarding the implementation dilemmas - is there at least chance you decide on some recommended approach internally, so it would be more clear from the community perspective what MR would be accepted?
A warning is definitely better than the status quo — though I think the actual resolution to this bug may be quick here as well.
My instinct here is that pg_tables
is faster than svv_tables
when it's available. This is hinted at in Redshift docs, e.g.:
Amazon Redshift table names are stored in both PG_TABLES and STV_TBL_PERM; where possible, use PG_TABLES to return Amazon Redshift table names.
So I think the right move will be to reimplement redshift__list_relations_without_caching
to use pg_*
where possible, and svv_tables
where necessary. Something like:
{% macro redshift__list_relations_without_caching(schema_relation) %}
{% if schema_relation.database == target.database %}
{{ return(postgres__list_relations_without_caching(schema_relation) }}
{% else %}
{% call statement('list_relations_without_caching', fetch_result=True) -%}
select
table_catalog as database,
table_name as name,
table_schema as schema,
table_type as type
from svv_tables
where table_catalog ilike '{{ schema_relation.database }}'
and table_schema ilike '{{ schema_relation.schema }}'
{% endcall %}
{{ return(load_result('list_relations_without_caching').table) }}
{% endmacro %}
You can actually give that a try by copy-pasting that macro code into your own project first — dbt will prefer your version over its own built-in one — to see if that resolves the bug.
Hi @jtcohen6,
I've returned to the issue recently as this a key blocker for us to adopt a local development workflow, which is based on creating proper proxy views in dev
database when upstream tables exist in the uat
database. dev
db is the one specified in the active target, so due to the issue - we can't verify the existence of tables in uat
database.
There's also another solution for local development which was presented to us today by dbt
Core Team:
https://github.com/LewisDavies/upstream-prod
but as far as I can see - it would be affected by the issue as well:
https://github.com/LewisDavies/upstream-prod/blob/43ede2a1913235c3db7ca8c94d89eadaf04fe127/macros/ref.sql#L36-L40
So, I have tested the custom postgres__list_relations_without_caching
macro you proposed, but unfortunately it didn't work due to:
Encountered an error while running operation: Runtime Error
maximum recursion depth exceeded while calling a Python object
I will try to debug that and find some solution in the upcoming days, but I would really appreciate prioritising that issue - especially when it directly affects approaches recommended by dbt
consultants.
Hi @jaklan, I believe that the error is because of this
{% macro postgres__list_relations_without_caching(schema_relation) %}
{% if schema_relation.database == target.database %}
{{ return(postgres__list_relations_without_caching(schema_relation) }}
{% else %}
...
{% endmacro %}
The function is calling itself recursively forever if the first if
is True
.
With a bit of copy/pasting from the original macro it could look like:
{% macro postgres__list_relations_without_caching(schema_relation) %}
{% if schema_relation.database == target.database %}
{% call statement('list_relations_without_caching', fetch_result=True) -%}
select
'{{ schema_relation.database }}' as database,
tablename as name,
schemaname as schema,
'table' as type
from pg_tables
where schemaname ilike '{{ schema_relation.schema }}'
union all
select
'{{ schema_relation.database }}' as database,
viewname as name,
schemaname as schema,
'view' as type
from pg_views
where schemaname ilike '{{ schema_relation.schema }}'
{% endcall %}
{{ return(load_result('list_relations_without_caching').table) }}
{% else %}
{% call statement('list_relations_without_caching', fetch_result=True) -%}
select
table_catalog as database,
table_name as name,
table_schema as schema,
table_type as type
from svv_tables
where table_catalog ilike '{{ schema_relation.database }}'
and table_schema ilike '{{ schema_relation.schema }}'
{% endcall %}
{{ return(load_result('list_relations_without_caching').table) }}
{% endmacro %}
Sorry! I meant to name the macro in my example above redshift__list_relations_without_caching
. I've just edited it to reflect that change.
Thanks @jtcohen6 and @b-per, good catch! I tried both variants you pasted and they don't trigger the error, but they return None
for tables which for sure exist in the target db - I will have a further look at that tomorrow.
Back to you guys - I was able to make it work by using svv_redshift_tables
instead of svv_tables
and adjusting the column names:
{% macro redshift__list_relations_without_caching(schema_relation) %}
{% if schema_relation.database == target.database %}
{{ return(postgres__list_relations_without_caching(schema_relation)) }}
{% else %}
{% call statement('list_relations_without_caching', fetch_result=True) -%}
select
database_name as database,
table_name as name,
schema_name as schema,
table_type as type
from svv_redshift_tables
where database_name ilike '{{ schema_relation.database }}'
and schema_name ilike '{{ schema_relation.schema }}'
{% endcall %}
{{ return(load_result('list_relations_without_caching').table) }}
{% endif %}
{% endmacro %}
svv_tables
only returned schemas and tables for target database, that's why I got None
previously. But generally there's still an issue with performance, as querying svv_redshift_tables
seems to be pretty slow with the current query.
There's also svv_all_tables
which is a superset of svv_redshift_tables
and works as well, but it's obviously even slower.
Could you share a ballpark figure of how long querying svv_redshift_tables
takes in your case?
@b-per
select
table_catalog as database,
table_name as name,
table_schema as schema,
table_type as type
from SVV_TABLES
where table_catalog ilike 'dev'
and table_schema ilike 'some_schema'
takes 1 row(s) fetched - 119ms
select
database_name as database,
table_name as name,
schema_name as schema,
table_type as type
from SVV_REDSHIFT_TABLES
where database_name ilike 'dev'
and schema_name ilike 'some_schema'
takes 1 row(s) fetched - 5.517s
For the context:
select distinct
table_catalog as database
from SVV_TABLES
gives 1 row(s) fetched - 2.621s
and
select count(*)
from SVV_TABLES
returns 5179
select distinct
database_name as database
from SVV_REDSHIFT_TABLES
gives 38 row(s) fetched - 5.404s
and
select count(*)
from SVV_REDSHIFT_TABLES
returns 28263
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.
Bump to un-stale
re-opening bc this was not fixed by #428 only addressed the redshift__list_relations_without_caching
macro, not redshift_get_columns_in_relation
Is this a new bug in dbt-core?
Current Behavior
adapter.get_relation
seems to ignoredatabase
argument and use the target database anyway.Expected Behavior
The right database is used.
Steps To Reproduce
I have a model, defined as below, which checks if it already exists in the same schema, but in another table. If it exists - it prints a log message and reads data from there:
dev
andintegrated
.dev
I have that table already created, inintegrated
- not.dev
.Now, when running
dbt run -m "<model_identifier>"
, I would expect it to do nothing indev
database. But, instead of that I get:so
adapter.get_relation
claims to be able to find the table in theintegrated
database - although it doesn't exist there. And because it doesn't exist - the run fails when it tries to actually read the data from there.PS that's only a dummy example to visualise the issue, please don't focus on the logic itself
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
redshift
Additional Context
No response