ClickHouse / dbt-clickhouse

The Clickhouse plugin for dbt (data build tool)
Apache License 2.0
253 stars 113 forks source link

max_bytes_before_external_group_by not working with DBT #351

Open rupesh-plateiq opened 2 months ago

rupesh-plateiq commented 2 months ago

Describe the bug

I am running a dbt model with _max_bytes_before_external_groupby settings. Still, I am getting Memory Error. If the run the same query on ClickHouse cloud it works perfectly fine

SELECT invoice_id,
        reason,
        argMax(message, date) AS message,
        MAX(_synced_at) AS _synced_at
FROM stg_invoices_invoicelog
GROUP BY invoice_id, reason
SETTINGS max_bytes_before_external_group_by=1000000000;

DBT model definition

{{
    config(
        materialized='incremental',
        incremental_strategy='delete+insert',
        order_by='invoice_id',
        unique_key='invoice_id',
        query_settings={'max_bytes_before_external_group_by': 1000000000}
    )
}}

SELECT invoice_id,
        reason,
        argMax(message, date) AS message,
        MAX(_synced_at) AS _synced_at
FROM stg_invoices_invoicelog
GROUP BY invoice_id, reason

Expected behaviour

It should not give error

Configuration

Environment

ClickHouse server

genzgd commented 2 months ago

Can you check the ClickHouse query log (system.query_log) for the dbt version of the query? The query log should include the settings and you can compare the two queries directly.

genzgd commented 2 months ago

There's also a major difference between a DBT "incremental" model and a simple ClickHouse SELECT.

rupesh-plateiq commented 2 months ago

Can you check the ClickHouse query log (system.query_log) for the dbt version of the query? The query log should include the settings and you can compare the two queries directly.

This is the query that I got from the log. It is not applying the query settings. Am I doing anything incorrect when defining the query settings? @genzgd

/* {"app": "dbt", "dbt_version": "1.8.5", "profile_name": "ottimate", "target_name": "dev", "node_id": "model.ottimate.int_invoices_invoicelog"} */

        create table `ottimate`.`int_invoices_invoicelog`

  engine = MergeTree()
        order by (invoice_id)

                    -- end_of_sql
                    SETTINGS  replicated_deduplication_window=0

            empty
          as (

SELECT invoice_id,
        reason,
        argMax(message, date) AS message,
        MAX(_synced_at) AS _synced_at
FROM stg_invoices_invoicelog
GROUP BY invoice_id, reason
          )
rupesh-plateiq commented 2 months ago

There's also a major difference between a DBT "incremental" model and a simple ClickHouse SELECT.

This is kind of full refresh. I have cleaned the logic for incremental model to test if the simple query works.

Thanks for your quick reply.

genzgd commented 2 months ago

Thanks for the detailed log data! That looks like a bug in "full refresh", where an "intermediate" table is created and then swapped, but that intermediate table doesn't appear to have the query_settings applied. We'll see if we can pin down the actual problem over the next few days.