AltimateAI / vscode-dbt-power-user

This extension makes vscode seamlessly work with dbt™: Auto-complete, preview, column lineage, AI docs generation, health checks, cost estimation etc
https://www.altimate.ai
MIT License
442 stars 85 forks source link

Could not load register_upstream_external_models in dbt-duckdb #1381

Open xqbumu opened 3 weeks ago

xqbumu commented 3 weeks ago

Expected behavior

I’m currently using dbt-duckdb and writing data to Minio. dbt run executes successfully and writes data to Minio, but I cannot execute Execute dbt SQL in the VS Code plugin.

When clicking "Execute dbt SQL", it is possible to query the raw model stored in S3 from the stg model.

And more, I’ve added the following configuration to my dbt_project.yml file:

on-run-start:
  - "{{ register_upstream_external_models() }}"

I found selected_resources is empty in macro register_upstream_external_models. Therefore, the macro will do nothing.

Actual behavior

When clicking "Execute dbt SQL", Query Results Show:

Runtime Error Catalog Error: Table with name raw_events does not exist! Did you mean "pg_catalog.pg_views"? LINE 6: SELECT * FROM "memory"."main"."raw_events" ^

Steps To Reproduce

file: s3://playground/raw/events.json

{"datetime":"2024-07-01T00:03:48.072909+08:00","action":"login","user_id":1,"ip":"127.0.0.1"}
{"datetime":"2024-07-01T00:03:48.072909+08:00","action":"pay","user_id":1,"amount":2}

file: raw_events.sql

{{
  config(
    format='parquet',
    incremental_strategy='insert_overwrite',
    incremental=True,
    location='s3://playground/warehouse/raw_events',
    materialized='external',
    options={"partition_by": "dw_date", "overwrite_or_ignore": 1}
  )
}}

SELECT
  datetime as dw_ts,
  CAST(dw_ts AS DATE) AS dw_date,
  * EXCLUDE (datetime)
FROM read_json_auto("s3://playground/raw/events.json", sample_size=204800)

file: stg_events.sql

{{
  config(
    format='parquet',
    incremental_strategy='insert_overwrite',
    incremental=True,
    location='s3://playground/warehouse/stg_events',
    materialized='external',
    options={"partition_by": "dw_date", "overwrite_or_ignore": 1}
  )
}}

SELECT * FROM {{ ref('raw_events') }}

Log output/Screenshots

Runtime Error Catalog Error: Table with name raw_events does not exist! Did you mean "pg_catalog.pg_views"? LINE 6: SELECT * FROM "memory"."main"."raw_events" ^

View Detailed Error 🚨 { "code": -1, "message": "Runtime Error\n Catalog Error: Table with name raw_events does not exist!\n Did you mean \"pg_catalog.pg_views\"?\n LINE 6: SELECT FROM \"memory\".\"main\".\"raw_events\"\n ^", "data": "\"Error: Runtime Error\n Catalog Error: Table with name raw_events does not exist!\n Did you mean \\"pg_catalog.pg_views\\"?\n LINE 6: SELECT FROM \\"memory\\".\\"main\\".\\"raw_events\\"\n ^\n\tat DBTCoreProjectIntegration_1. (/Users/bumu/.vscode/extensions/innoverio.vscode-dbt-power-user-0.44.4/dist/extension.js:18686:127)\n\tat Generator.throw ()\n\tat rejected (/Users/bumu/.vscode/extensions/innoverio.vscode-dbt-power-user-0.44.4/dist/extension.js:27066:28)\"" }

Operating System

macOS 14.6.1

dbt version

1.8.5

dbt Adapter

dbt-duckdb 1.8.3

dbt Power User version

0.44.4

Are you willing to submit PR?

mdesmet commented 2 weeks ago

Can I summarize your issue as the extension doesn't execute {{ register_upstream_external_models() }} before executing your query?

xqbumu commented 2 weeks ago

Can I summarize your issue as the extension doesn't execute {{ register_upstream_external_models() }} before

I've conducted a thorough debugging session on dbt_core_integration.py, and based on my findings, I believe that the issue might stem from the lack of a built model dependency graph or update model dependency graph in dbt-core. This could explain why CREATE OR REPLACE VIEW statements aren't being generated which from register_upstream_external_models.

Unfortunately, as I'm not deeply acquainted with the inner workings of dbt and vscode-dbt-power-user, I'm unable to provide a more definitive answer at this time. I'll continue to delve into the codebase to uncover the precise cause.

mdesmet commented 2 weeks ago

Can you explain what {{ register_upstream_external_models() }} is supposed to do? If you remove this on_run_start definition, does running dbt run work?

xqbumu commented 2 weeks ago

Can you explain what {{ register_upstream_external_models() }} is supposed to do? If you remove this on_run_start definition, does running dbt run work?

Thank you for your prompt response. I apologize that I was unable to provide feedback in a timely manner due to external factors.

Here is the source code for the register_upstream_external_models macro:

https://github.com/duckdb/dbt-duckdb/blob/master/dbt/include/duckdb/macros/utils/upstream.sql

This macro create or replace upstream views based on the graph to generate dependencies for current model queries.

When running dbt run directly, the program creates a duckdb database(maybe in memory) and fully traverses the graph and generates dependencies, so whether the register_upstream_external_models macro is added or not does not actually affect the final result.

However, when running dbt run --models stg_events, if the macro is not added, it will output:

07:59:46  Running with dbt=1.8.5
07:59:46  Registered adapter: duckdb=1.8.3
07:59:47  Found 3 models, 1 source, 410 macros
07:59:47  
07:59:47  Concurrency: 1 threads (target='dev')
07:59:47  
07:59:47  1 of 1 START sql external model main.stg_events ................................ [RUN]
07:59:47  1 of 1 ERROR creating sql external model main.stg_events ....................... [ERROR in 0.11s]
07:59:47  
07:59:47  Finished running 1 external model in 0 hours 0 minutes and 0.33 seconds (0.33s).
07:59:47  
07:59:47  Completed with 1 error and 0 warnings:
07:59:47  
07:59:47    Runtime Error in model stg_events (models/basic/stg_events.sql)
  Catalog Error: Table with name raw_events does not exist!
  Did you mean "temp.pg_catalog.pg_views"?
  LINE 13: SELECT action FROM "playground"."main"."raw_events"
                              ^
07:59:47  
07:59:47  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
make: *** [model] Error 1

So, adding this macro is necessary.

mdesmet commented 2 weeks ago

Running the macro is necessary, it just works in dbt run because on-run-start is executed before the graph traversal. So my analysis is right. The way to fix this is to also execute on-run-start and possibly on-run-end before executing queries. This needs more analysis to know which places where we possibly need to do that.

xqbumu commented 2 weeks ago

Running the macro is necessary, it just works in dbt run because on-run-start is executed before the graph traversal. So my analysis is right. The way to fix this is to also execute on-run-start and possibly on-run-end before executing queries. This needs more analysis to know which places where we possibly need to do that.

I believe we need to activate some graph dependency nodes. While register_upstream_external_models essentially creates or replaces views based on selected_resources and the graph, it seems like it's not fully addressing the issue.

When I manually copied the code from register_upstream_external_models into a model and click Execute dbt SQL, the problem persisted.

Currently, I have only made manual modifications in file dbt_core_integration.py:

# append to ConfigInterface.__init__ function
self.select = ["stg_events"]
self.selector = None
self.exclude = ()
# append to DbtProject.parse_project function after build_flat_graph function call
ct = CompileTask(get_flags(), self.config, self.dbt)
ct.compile_manifest()
ct.get_graph_queue()

The two changes cause the operation register_upstream_external_models to enter a loop, but the graph remains empty.

So, the create or replace view statement still empty.