TobikoData / sqlmesh

Efficient data transformation and modeling framework that is backwards compatible with dbt.
https://sqlmesh.com
Apache License 2.0
1.51k stars 125 forks source link

Self-referenced models lose their data lineage #2413

Closed Giganoide01 closed 3 months ago

Giganoide01 commented 3 months ago

I'm facing an issue implementing a self-referenced incremental model.

The model receives an external table, with defined types for its columns, and makes a left join with itself. But it's not capable of finding its runtime-generated table.

Environment Versions

  1. Ubuntu 22.04.3 LTS
  2. Python 3.10.12
  3. sqlmesh 0.84.0
  4. duckdb v0.10.1 4a89d97db8

Steps to replicate

  1. run the following bash command to initialize the external model
    
    duckdb db.db -c "create schema if not exists mock_data;
    create table if not exists mock_data.input_table (id BIGINT, event_date VARCHAR);
    insert into mock_data.input_table values (1, '20240101'), (2, '20240102');"

2. create config.yaml as:

gateways: local: connection: type: duckdb database: db.db

default_gateway: local

model_defaults: dialect: duckdb


3. create models/incremental_model.sql as:

MODEL ( name abcd.incremental_model, kind INCREMENTAL_BY_TIME_RANGE ( time_column event_date ), start '2024-01-01', cron '@daily', );

select mock_data.input_table.id, mock_data.input_table.event_date, abcd.incremental_model.event_date as previous_event_date from mock_data.input_table left join abcd.incremental_model on mock_data.input_table.id = abcd.incremental_model.id


4. run `sqlmesh create_external_models`. This command creates schema.yaml and it should look like this:
  1. run sqlmesh plan dev. You should get an error like this:

    Error: Failed processing SnapshotId<"db"."abcd"."incremental_model": 1531324868>. Catalog Error: Table with name abcd__incremental_model__2081046935__temp does not exist!
    Did you mean "temp.pg_catalog.pg_index"?
    LINE 1: ...nput_table" AS "input_table" LEFT JOIN "db"."sqlmesh__abcd"."abcd__incremental...
  2. to correct this behavior, add an alias to abcd.incremental_model and specify the type of the incoming column in models/incremental_model.sql, as shown below:

    
    MODEL (
    name abcd.incremental_model,
    kind INCREMENTAL_BY_TIME_RANGE (
        time_column event_date
    ),
    start '2024-01-01',
    cron '@daily',
    );

select mock_data.input_table.id, mock_data.input_table.event_date, processed_table.event_date::TEXT as previous_event_date from mock_data.input_table left join abcd.incremental_model as processed_table on mock_data.input_table.id = processed_table.id



7. run `sqlmesh plan dev` again. It should work correctly.

I expected that the model could find its runtime-generated table without needing to define its types explicitly, since there are no changes in the types used in the operation to the ones defined for mock_data.input\_table in schema.yaml.
tobymao commented 3 months ago

this is now fixed in sqlglot v23.8.2

https://github.com/tobymao/sqlglot/commit/281db61009ee01d10690dcc1f2039062b2a1a58c

in general, i wouldn't recommend doing fully qualified column names as you've done and use aliases, but you are right that it should work so this has been fixed.

with the other issue regarding the schema / type for self references, this is a restriction of models that do self referencing. in order to have a model self reference, we need to be able to create the table statically. we can't create a self reference table with a CTAS query because it doesn't exist yet.

i'll document this / add an error. (https://github.com/TobikoData/sqlmesh/pull/2419/files)