[CT-2343] [Bug] snapshot with custom datatypes breaks when source table is regenerated

dbt-labs / dbt-postgres

Apache License 2.0

21 stars 12 forks source link

[CT-2343] [Bug] snapshot with custom datatypes breaks when source table is regenerated #13

Open Bonnevie opened 1 year ago

Bonnevie commented 1 year ago

Is this a new bug in dbt-core?

[X] I believe this is a new bug in dbt-core
[X] I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I've set-up a snapshot table in a postgres database, and it has a column named annotation_type that uses a custom enum datatype. This has worked without problems for weeks, but after the table had to be truncated and restored due to a faulty replica setup we encountered the following error when trying t:

12:22:53  Database Error in snapshot annotations_image_snapshot (snapshots/annotations_image_snapshot.sql)
12:22:53    syntax error at or near "USER"
12:22:53    LINE 3: ...ions_image_snapshot" add column "annotation_type" USER-DEFIN...

I cannot find the sql file that the error is in reference to. the snapshot table was also suddenly lacking the annotation_type column, despite snapshots supposedly never deleting columns? We are at a development stage where we can still afford to do a full restart of the snapshot, but we'd rather not encounter this in the future. Our hypothesis is that the custom type is somehow not matched appropriately following regeneration?

Expected Behavior

I would expect snapshots to be robust to no-ops like this, and at least to never incur data loss. If custom types are unsupported/problematic, it'd be nice to be warned.

Steps To Reproduce

Not easily reproduced, but most likely course:

build source table with custom enum type in postgres
take snapshot
truncate table, then rebuild
take snapshot

Relevant log output

No response

Environment

No response

Which database adapter are you using with dbt?

postgres

Additional Context

No response

dbeatty10 commented 1 year ago

Thanks for opening this @Bonnevie !

I made an attempt at reproducing this, but wasn't able to trigger the same error as you. Could you tweak the simple example below to reproduce what you were seeing?

I am using a local postgres database as described here so you'll need to tweak your database, schema, etc as-needed.

dbt project

models/_sources.yml

version: 2

sources:
  - name: dbt_dbeatty
    database: postgres
    schema: dbt_dbeatty 
    tables:
      - name: person

snapshots/my_snapshot.sql

{% snapshot my_snapshot %}

{{
    config(
      target_database='postgres',
      target_schema='snapshots',
      strategy='check',
      check_cols='all',
      unique_key='name'
    )
}}

select * from {{ source("dbt_dbeatty", "person") }}

{% endsnapshot %}

Steps

1) Build source table with custom enum type in postgres (using example here)

Using postgres IDE of choice:

CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');

CREATE TABLE dbt_dbeatty.person (
    name text,
    current_mood mood
);
INSERT INTO dbt_dbeatty.person VALUES ('Moe', 'happy');

2) Take a snapshot:

dbt build

3) Truncate table, then rebuild

Using postgres IDE of choice again:

truncate TABLE dbt_dbeatty.person;
INSERT INTO dbt_dbeatty.person VALUES ('Curly', 'ok');

4) Take another snapshot:

dbt build

benorourke commented 1 year ago

I'm also experiencing this issue. The macro postgres__get_columns_in_relation, which is called when the snapshot checks for any column changes, selects data_typefrom the information_schema. For any UDT this will return USER-DEFINED. We can instead conditionally select udt_name to get the name of the enum.

As a temporary fix, I have overridden postgres__get_columns_in_relation in override_default_macros.sql to resolve udt names along with their schema:

{% macro postgres__get_columns_in_relation(relation) -%}
  {% call statement('get_columns_in_relation', fetch_result=True) %}
      select
          column_name,
          case
            when (data_type = 'USER-DEFINED') then concat(udt_schema || '.' || udt_name)
            else data_type
          end as data_type,
          character_maximum_length,
          numeric_precision,
          numeric_scale

      from {{ relation.information_schema('columns') }}
      where table_name = '{{ relation.identifier }}'
        {% if relation.schema %}
        and table_schema = '{{ relation.schema }}'
        {% endif %}
      order by ordinal_position

  {% endcall %}
  {% set table = load_result('get_columns_in_relation').table %}
  {{ return(sql_convert_columns_in_relation(table)) }}
{% endmacro %}

dbeatty10 commented 1 year ago

After attempting again with the following steps, I was able to reproduce the reported issue.

Also confirmed that the solution from @benorourke worked 🏅

Steps

1) Create a simple table that uses only built-in data types

Using postgres IDE of choice:

CREATE TABLE dbt_dbeatty.person (
    name text
);

INSERT INTO dbt_dbeatty.person VALUES ('Moe');

2) Take a snapshot:

dbt snapshot

3) Add a column that uses a custom data type

Build source table with custom enum type in postgres (using example here).

Using postgres IDE of choice again:

CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');

-- Yeah, a simple ALTER TABLE ... would be better here! But this is sufficient to reproduce
DROP TABLE dbt_dbeatty.person;
CREATE TABLE dbt_dbeatty.person (
    name text,
    current_mood mood
);

INSERT INTO dbt_dbeatty.person VALUES ('Curly', 'ok');

4) Take another snapshot:

dbt snapshot

💥

5) Override the postgres__get_columns_in_relation macro

Copy-paste the solution described here (untested) or here into macros/overrides.sql

6) Retry taking another snapshot:

dbt snapshot

✅

aBBDnGus commented 1 year ago

I have a similar bug, which relates to this: If I have an incremental table with option on_schema_change='sync_all_columns'. If I know add a column with user-defined type to the incremental model, the same error occurs. It can be fixed with the provided solution.

Environment: PostgreSQL, dbt 1.6.1

dbeatty10 commented 11 months ago

Here is an alternative override for postgres__get_columns_in_relation that might be better (but I didn't test it out with this issue):

{% macro postgres__get_columns_in_relation(relation) -%}
  {% call statement('get_columns_in_relation', fetch_result=True) %}

    select
      col.attname as column_name,

      -- Two options for formatting the data type:
      -- 1) do not include the type modifier (e.g., `geometry`)
      -- 2) include the type modifier when appliable (e.g., `geometry(Point,4326)`)

      -- Option 1:
      pg_catalog.format_type(col.atttypid, null) as data_type,

      -- Option 2:
      -- pg_catalog.format_type(col.atttypid, col.atttypmod) as data_type,

      null as character_maximum_length,
      null as numeric_precision,
      null as numeric_scale

    from pg_catalog.pg_namespace sch
    join pg_catalog.pg_class tbl on tbl.relnamespace = sch.oid
    join pg_catalog.pg_attribute col on col.attrelid = tbl.oid
    left outer join pg_catalog.pg_description tbl_desc on (tbl_desc.objoid = tbl.oid and tbl_desc.objsubid = 0)
    left outer join pg_catalog.pg_description col_desc on (col_desc.objoid = tbl.oid and col_desc.objsubid = col.attnum)

    where 1=1
      and not pg_is_other_temp_schema(sch.oid) -- not a temporary schema belonging to another session
      and tbl.relpersistence in ('p', 'u') -- [p]ermanent table or [u]nlogged table. Exclude [t]emporary tables
      and tbl.relkind in ('r', 'v', 'f', 'p') -- o[r]dinary table, [v]iew, [f]oreign table, [p]artitioned table. Other values are [i]ndex, [S]equence, [c]omposite type, [t]OAST table, [m]aterialized view
      and col.attnum > 0 -- negative numbers are used for system columns such as oid
      and not col.attisdropped -- column has not been dropped

      and tbl.relname = '{{ relation.identifier }}'
        {% if relation.schema %}
        and sch.nspname = '{{ relation.schema }}'
        {% endif %}

    order by
        sch.nspname,
        tbl.relname,
        col.attnum

  {% endcall %}
  {% set table = load_result('get_columns_in_relation').table %}
  {{ return(sql_convert_columns_in_relation(table)) }}
{% endmacro %}

github-actions[bot] commented 1 month ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

magno32 commented 3 weeks ago

This issue also exists in unit tests. The workaround provided by @dbeatty10 gets us past the USER-DEFINED cast in the generated SQL for the unit-test, but any default geometry values given in the input/expect rows are wrapped as strings. I'm guessing there is another macro somewhere that would fix that.

dbeatty10 commented 3 weeks ago

@magno32 could you please open an issue here for using the geometry data type within unit tests?