ClickHouse / dbt-clickhouse

The Clickhouse plugin for dbt (data build tool)
Apache License 2.0
246 stars 108 forks source link

Support allow_nullable_key config for models #209

Closed GonnaMakeYouCry closed 11 months ago

GonnaMakeYouCry commented 11 months ago

Describe the bug

Incorrect behavior of the model when using the setting allow_nullable_key

Steps to reproduce

My dbt model create ETL process by reading the data from PostgresSQL (I use PostgreSQL Engine) and load the data into Clickhouse. Data Source table in Postgres has a Nullable setting for columns which I use as uniqueKey and partiotionKey (In fact, there are no Null values there)

  1. Create a model with conf allow_nullable_key
  2. Run the model. Everything is ok. 3.When I run the model againg I have an error Code: 115. DB::Exception: Setting allow_nullablekey is neither a builtin setting nor started with the prefix 'custom' registered for user-defined settings. (UNKNOWN_SETTING) (version 23.3.15.29 (official build))

Expected behaviour

Re-running the model should not lead to errors

Code examples, such as models or profile settings

Model conf {{ config( order_by='projectId', engine='MergeTree()', partition_by='projectId', materialized='incremental', settings = {'allow_nullable_key': 1}, unique_key='modelId', schema='core') }}

dbt and/or ClickHouse server logs

I used this query to watch where the error happened: select query from system.query_log where exception like '%115. DB::Exception%'

And query result:

/ {"app": "dbt", "dbt_version": "1.4.9", "profile_name": "clickhouse_beta", "target_name": "beta", "node_id": "model.iyno_data_marts.actual_project_data"} /

    insert into core.actual_project_data__dbt_tmp ("projectId", "periodId", "modelId")
    select "projectId", "periodId", "modelId"
    from core.actual_project_data
      where (modelId) not in (
        select modelId
        from core.actual_project_data__dbt_new_data
      )
   SETTINGS  allow_nullable_key=1

Configuration

Core:

Plugins:

ClickHouse server

simpl1g commented 11 months ago

@GonnaMakeYouCry do you use cluster or single server setup? I tried to reproduce with single server and it works fine for me.

According to code, *__dbt_tmp created only with distributed_table materialization. Maybe you can share your full debug log from dbt. dbt -d run -s yourmodel to understand why it is not working for you

GonnaMakeYouCry commented 11 months ago

@simpl1g We use Yandex Cloud CH Managed Service And the log after using command dbt -d run -s yourmodel

============================== 2023-11-20 07:16:20.020996 | e3382018-9d72-4754-870d-b3f1ffe168d1 ============================== 07:16:20.020996 [info ] [MainThread]: Running with dbt=1.4.9 07:16:20.023435 [debug] [MainThread]: running dbt with arguments {'debug': True, 'write_json': True, 'use_colors': True, 'printer_width': 80, 'version_check': True, 'partial_parse': True, 'static_parser': True, 'profiles_dir': '/home/airflow/.dbt', 'send_anonymous_usage_stats': True, 'quiet': False, 'no_print': False, 'cache_selected_only': False, 'select': ['./models/core/actual_project_data.sql'], 'which': 'run', 'rpc_method': 'run', 'indirect_selection': 'eager'} 07:16:20.024209 [debug] [MainThread]: Tracking: tracking 07:16:20.025864 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fdaedeb7b10>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fdaedeb7e50>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fdaedeb7e10>]} 07:16:20.046441 [debug] [MainThread]: checksum: 07ac7676dbf91b944d5fe41bac352fb2dda62b33015cbd1e52faa2a2c71f6d73, vars: {}, profile: None, target: None, version: 1.4.9 07:16:20.081888 [debug] [MainThread]: Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed. 07:16:20.082698 [debug] [MainThread]: Partial parsing enabled, no changes found, skipping parsing 07:16:20.084464 [warn ] [MainThread]: [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources. There are 1 unused configuration paths:

WITH project_active_periods AS ( SELECT pg_prod_repl.element_periods.projectId as "projectId" ,pg_prod_repl.element_periods.id as "periodId" FROM core.projects_in_work JOIN pg_prod_repl.projects
ON core.projects_in_work.projectId = pg_prod_repl.projects.id JOIN pg_prod_repl.element_periods ON core.projects_in_work.projectId = pg_prod_repl.element_periods.projectId WHERE True AND pg_prod_repl.element_periods.activeFrom < NOW() AND pg_prod_repl.element_periods.activeTo IS NULL AND pg_prod_repl.projects.deleted = 0 ), project_active_models AS ( SELECT core.projects_in_work.projectId AS "projectId" ,pg_prod_repl.models.id AS "modelId" FROM core.projects_in_work JOIN pg_prod_repl.projects ON core.projects_in_work.projectId = pg_prod_repl.projects.id JOIN pg_prod_repl.models ON core.projects_in_work.projectId = pg_prod_repl.models.projectId JOIN pg_prod_repl.model_pdata ON pg_prod_repl.models.id = pg_prod_repl.model_pdata.modelId JOIN pg_prod_repl.model_periods ON pg_prod_repl.model_pdata.periodId = pg_prod_repl.model_periods.id WHERE True AND pg_prod_repl.model_periods.activeTo IS NULL AND pg_prod_repl.model_periods.activeFrom < NOW() AND pg_prod_repl.projects.deleted = 0 ), actual_project_data AS ( SELECT project_active_periods.projectId AS projectId ,project_active_periods.periodId AS periodId ,project_active_models.modelId AS modelId -- assumeNotNull(project_active_periods.projectId) AS projectId -- ,assumeNotNull(project_active_periods.periodId) AS periodId -- ,assumeNotNull(project_active_models.modelId) AS modelId ,date_add(HOUR, 3, NOW()) AS "SYS_MOMENT" FROM project_active_periods JOIN project_active_models ON project_active_periods."projectId" = project_active_models."projectId" )

SELECT * FROM actual_project_data

-- this filter will only be applied on an incremental run where modelId NOT IN (select modelId from core.actual_project_data)

)
    ...

07:16:21.232101 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.30 seconds 07:16:21.245551 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.iyno_data_marts.actual_project_data: / {"app": "dbt", "dbt_version": "1.4.9", "profile_name": "clickhouse_beta", "target_name": "beta", "node_id": "model.iyno_data_marts.actual_project_data"} /

select name, type from system.columns where table = 'actual_project_data__dbt_new_data'

  and database = 'core'

order by position

... 07:16:21.249791 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.00 seconds 07:16:21.255539 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.iyno_data_marts.actual_project_data: / {"app": "dbt", "dbt_version": "1.4.9", "profile_name": "clickhouse_beta", "target_name": "beta", "node_id": "model.iyno_data_marts.actual_project_data"} /

    insert into core.actual_project_data__dbt_new_data ("projectId", "periodId", "modelId", "SYS_MOMENT")

WITH project_active_periods AS ( SELECT pg_prod_repl.element_periods.projectId as "projectId" ,pg_prod_repl.element_periods.id as "periodId" FROM core.projects_in_work JOIN pg_prod_repl.projects
ON core.projects_in_work.projectId = pg_prod_repl.projects.id JOIN pg_prod_repl.element_periods ON core.projects_in_work.projectId = pg_prod_repl.element_periods.projectId WHERE True AND pg_prod_repl.element_periods.activeFrom < NOW() AND pg_prod_repl.element_periods.activeTo IS NULL AND pg_prod_repl.projects.deleted = 0 ), project_active_models AS ( SELECT core.projects_in_work.projectId AS "projectId" ,pg_prod_repl.models.id AS "modelId" FROM core.projects_in_work JOIN pg_prod_repl.projects ON core.projects_in_work.projectId = pg_prod_repl.projects.id JOIN pg_prod_repl.models ON core.projects_in_work.projectId = pg_prod_repl.models.projectId JOIN pg_prod_repl.model_pdata ON pg_prod_repl.models.id = pg_prod_repl.model_pdata.modelId JOIN pg_prod_repl.model_periods ON pg_prod_repl.model_pdata.periodId = pg_prod_repl.model_periods.id WHERE True AND pg_prod_repl.model_periods.activeTo IS NULL AND pg_prod_repl.model_periods.activeFrom < NOW() AND pg_prod_repl.projects.deleted = 0 ), actual_project_data AS ( SELECT project_active_periods.projectId AS projectId ,project_active_periods.periodId AS periodId ,project_active_models.modelId AS modelId -- assumeNotNull(project_active_periods.projectId) AS projectId -- ,assumeNotNull(project_active_periods.periodId) AS periodId -- ,assumeNotNull(project_active_models.modelId) AS modelId ,date_add(HOUR, 3, NOW()) AS "SYS_MOMENT" FROM project_active_periods JOIN project_active_models ON project_active_periods."projectId" = project_active_models."projectId" )

SELECT * FROM actual_project_data

-- this filter will only be applied on an incremental run where modelId NOT IN (select modelId from core.actual_project_data)

  ...

07:16:22.074887 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.82 seconds 07:16:22.077426 [debug] [Thread-1 ]: Writing runtime sql for node "model.iyno_data_marts.actual_project_data" 07:16:22.078670 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.iyno_data_marts.actual_project_data: / {"app": "dbt", "dbt_version": "1.4.9", "profile_name": "clickhouse_beta", "target_name": "beta", "node_id": "model.iyno_data_marts.actual_project_data"} /

      create table core.actual_project_data__dbt_tmp as core.actual_project_data__dbt_new_data 

  ...

07:16:22.097520 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.02 seconds 07:16:22.102323 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.iyno_data_marts.actual_project_data: / {"app": "dbt", "dbt_version": "1.4.9", "profile_name": "clickhouse_beta", "target_name": "beta", "node_id": "model.iyno_data_marts.actual_project_data"} /

select name, type from system.columns where table = 'actual_project_data'

  and database = 'core'

order by position

... 07:16:22.106706 [debug] [Thread-1 ]: dbt_clickhouse adapter: SQL status: OK in 0.00 seconds 07:16:22.109486 [debug] [Thread-1 ]: dbt_clickhouse adapter: On model.iyno_data_marts.actual_project_data: / {"app": "dbt", "dbt_version": "1.4.9", "profile_name": "clickhouse_beta", "target_name": "beta", "node_id": "model.iyno_data_marts.actual_project_data"} /

    insert into core.actual_project_data__dbt_tmp ("projectId", "periodId", "modelId", "SYS_MOMENT")
    select "projectId", "periodId", "modelId", "SYS_MOMENT"
    from core.actual_project_data
      where (modelId) not in (
        select modelId
        from core.actual_project_data__dbt_new_data
      )
   SETTINGS  allow_nullable_key=1

...

07:16:22.113751 [debug] [Thread-1 ]: dbt_clickhouse adapter: Error running SQL: / {"app": "dbt", "dbt_version": "1.4.9", "profile_name": "clickhouse_beta", "target_name": "beta", "node_id": "model.iyno_data_marts.actual_project_data"} /

    insert into core.actual_project_data__dbt_tmp ("projectId", "periodId", "modelId", "SYS_MOMENT")
    select "projectId", "periodId", "modelId", "SYS_MOMENT"
    from core.actual_project_data
      where (modelId) not in (
        select modelId
        from core.actual_project_data__dbt_new_data
      )
   SETTINGS  allow_nullable_key=1

07:16:22.114606 [debug] [Thread-1 ]: Timing info for model.iyno_data_marts.actual_project_data (execute): 2023-11-20 07:16:20.774855 => 2023-11-20 07:16:22.114509 07:16:22.117265 [debug] [Thread-1 ]: Database Error in model actual_project_data (models/core/actual_project_data.sql) :HTTPDriver for http://65vl5ksg.mdb.yandexcloud.net:8123 returned response code 404) Code: 115. DB::Exception: Setting allow_nullablekey is neither a builtin setting nor started with the prefix 'custom' registered for user-defined settings. (UNKNOWN_SETTING) (version 23.3.15.29 (official build)) compiled Code at target/run/iyno_data_marts/models/core/actual_project_data.sql 07:16:22.118169 [debug] [Thread-1 ]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': 'e3382018-9d72-4754-870d-b3f1ffe168d1', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fdacf762c50>]} 07:16:22.119424 [error] [Thread-1 ]: 1 of 1 ERROR creating sql incremental model core.actual_project_data ........... [ERROR in 1.38s] 07:16:22.122063 [debug] [Thread-1 ]: Finished running node model.iyno_data_marts.actual_project_data 07:16:22.149095 [debug] [MainThread]: Acquiring new clickhouse connection 'master' 07:16:22.150340 [debug] [MainThread]: Connection 'master' was properly closed. 07:16:22.151008 [debug] [MainThread]: Connection 'model.iyno_data_marts.actual_project_data' was left open. 07:16:22.151685 [debug] [MainThread]: On model.iyno_data_marts.actual_project_data: Close 07:16:22.152411 [info ] [MainThread]: 07:16:22.153139 [info ] [MainThread]: Finished running 1 incremental model in 0 hours 0 minutes and 2.04 seconds (2.04s). 07:16:22.153897 [debug] [MainThread]: Command end result 07:16:22.222691 [info ] [MainThread]: 07:16:22.223473 [info ] [MainThread]: Completed with 1 error and 0 warnings: 07:16:22.224136 [info ] [MainThread]: 07:16:22.224747 [error] [MainThread]: Database Error in model actual_project_data (models/core/actual_project_data.sql) 07:16:22.225336 [error] [MainThread]: :HTTPDriver for http://5vl5ksg.mdb.yandexcloud.net:8123 returned response code 404) 07:16:22.225902 [error] [MainThread]: Code: 115. DB::Exception: Setting allow_nullablekey is neither a builtin setting nor started with the prefix 'custom' registered for user-defined settings. (UNKNOWN_SETTING) (version 23.3.15.29 (official build)) 07:16:22.226560 [error] [MainThread]: compiled Code at target/run/iyno_data_marts/models/core/actual_project_data.sql 07:16:22.227209 [info ] [MainThread]: 07:16:22.227799 [info ] [MainThread]: Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1 07:16:22.228522 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fdacf872450>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fdaeb2127d0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fdacf8d5e50>]} 07:16:22.229202 [debug] [MainThread]: Flushing usage events

genzgd commented 11 months ago

I believe this is a bug with the "legacy" incremental strategy where settings that are supposed to be DDL "Table" settings are used incorrectly in the INSERT statements. I believe this will be fixed in next upcoming patch release.