ClickHouse / dbt-clickhouse

The Clickhouse plugin for dbt (data build tool)
Apache License 2.0
246 stars 108 forks source link

DBT logs say models were successful but clickhouse logs say they weren't. #141

Closed alexsubota closed 6 days ago

alexsubota commented 1 year ago

Hi, sometimes I encounter strange dbt behavior when running a model and it says everything is fine, but when I check the clickhouse logs, it says that this run failed due to "Memory limit exceeded".

`Output: Running with dbt=1.3.2 Found 298 models, 8 tests, 0 snapshots, 105 analyses, 316 macros, 0 operations, 14 seed files, 213 sources, 17 exposures, 0 metrics Concurrency: 4 threads (target='prod')

1 of 1 START sql table model test_db.test_table .................. [RUN] 1 of 1 OK created sql table model test_db.test_table ............. [[32mOK[0m in 266.71s] Finished running 1 table model in 0 hours 4 minutes and 27.58 seconds (267.58s). Completed successfully PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1`

How can i fix this problem, and get error at dbt logs to, same as clickhouse logs

ClickHouse version 22.12 DBT version 1.3.0

genzgd commented 1 year ago

What is the query in the ClickHouse logs that is throwing the exception (this should be available in the system.query_log? Are you using distributed tables? Are you using lightweight deletes? Materialized views? As far as I know all errors from ClickHouse should be propagated up to dbt, so I would need more details to understand where (and maybe why) that is not happening.

alexsubota commented 1 year ago

this should be available in the system.query_log

Yes, its available

Are you using distributed tables?

Nope, only MergeTree table with order by, no more settings Model have simple select from table with many joins

Are you using lightweight deletes?

Nope

Materialized_ views?

Nope

genzgd commented 1 year ago

so what is the query that is causing memory limit exceeded? I'm guessing it's an insert into the table? Is it the CREATE TABLE statement? Is it an incremental materialization?

alexsubota commented 1 year ago

Not incremental, default table.

query smth like this

select col1, col2, col3, arrayFirst(), arrayFirst(), arrayCount() .....
from table  t
left join (select id, groupArray(...) from table2 group by id) t2 
on t1.id=t2.id
genzgd commented 1 year ago

Do you have any error lines at all in the logs? When I run a bad model, I get something like this:

15:00:57.416070 [error] [MainThread]: Encountered an error:
Database Error
  Error executing HTTP request http://localhost:8123

Unfortunately it does NOT have the useful information like "memory limit exceeded" you are seeing, but I want to make sure you are seeing at least that line.

genzgd commented 1 year ago

One more question -- is the SELECT statement that is generating the Memory Limit Exceeded error part of a CREATE TABLE AS SELECT ... FROM query or part of an INSERT INTO SELECT ... FROM statement. And if it's a create table statement, is it followed by EMPTY (or empty)?

alexsubota commented 1 year ago

So, dbt task was started at 02:06:22 and finished at 02:10:48

In the clickhouse log I see the following

02:06:22 QueryStart create table test_db.test_table__dbt_backup engine = MergeTree() order by (id) empty as (.... 02:06:22 QueryFinish create table test_db.test_table__dbt_backup engine = MergeTree() order by (id) empty as (.... 02:06:22 QueryStart select name, type from system.columns where table = 'test_table__dbt_backup' and database = 'test_db' order by position 02:06:22 QueryFinish select name, type from system.columns where table = 'test_table__dbt_backup' and database = 'test_db' order by position 02:06:22 QueryStart insert into test_db.test_table__dbt_backup ...... 02:10:46 ExceptionWhileProcessing insert into test_db.test_table__dbt_backup ...... 02:10:48 QueryStart EXCHANGE TABLES test_db.test_table__dbt_backup AND test_db.test_table 02:10:48 QueryFinish EXCHANGE TABLES test_db.test_table__dbt_backup AND test_db.test_table 02:10:48 QueryStart drop table if exists test_db.test_table__dbt_backup 02:10:48 QueryFinish drop table if exists test_db.test_table__dbt_backup

genzgd commented 1 year ago

That looks like dbt is not seeing the exception from the client and keeps going like the insert succeeded. Are you using the HTTP or the Native interface?

genzgd commented 1 year ago

And can you also post what is in the dbt logs for that time period, if anything? I'm trying to understand why that exception wasn't seen by dbt.

alexsubota commented 1 year ago

That looks like dbt is not seeing the exception from the client and keeps going like the insert succeeded. Are you using the HTTP or the Native interface?

HTTP

And can you also post what is in the dbt logs for that time period, if anything? I'm trying to understand why that exception wasn't seen by dbt.

This is not a recurring error, I don't have an up to date log right now

genzgd commented 1 year ago

Sorry for all the questions, but I haven't been able to reproduce this behavior and I would really like to figure out why this happens. What versions of clickhouse-connect and dbt-clickhouse did you have installed?

alexsubota commented 1 year ago

Version 1.3.1 for both, dbt and dbt-clickhouse

genzgd commented 1 year ago

Can you update dbt-clickhouse to 1.3.3 and clickhouse-connect to the latest version (0.5.10)?

alexsubota commented 1 year ago

@genzgd dbt=1.4.6 Still got this problem

Logs:

[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - 06:46:08
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - 06:46:08  Finished running 1 table model in 0 hours 4 minutes and 2.68 seconds (242.68s).
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - 06:46:08
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - 06:46:08  Completed successfully
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - 06:46:08
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - 06:46:08  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Clickhouse query log: image

genzgd commented 1 year ago

What is the actual model definition? What is the actual query in the ClickHouse logs? What is the in dbt debug log? Since thousands of dbt-clickhouse models are successfully are run every day, there's clearly something in your specific situation that is different, and it's not possible to tell what without complete diagnostic information.

marckeelingiv commented 1 year ago

Here is the dbt.log file when I get this error

Here is the model that when I am running when this error is produced.

{{ config(
    order_by = [ "patient_id" ]
) }}
{% set start_datetime = date_parse(var('CUO_start_datetime'), 'datetime') %}
{% set end_datetime = date_parse(var('CUO_end_datetime'), 'datetime') %}
-- STEP 1: Imports
WITH patient_facility_population AS (
    SELECT * FROM {{ ref('int_patient_facility_population_for_not_deleted') }}
),
patient_referral_population AS (
    SELECT * FROM {{ ref('int_referral_population_for_not_deleted')}}
),
patient_insurance_population AS (
    SELECT * FROM {{ ref('int_insurance_population_for')}}
),
patient_visit_population AS (
    SELECT * FROM {{ ref('int_visit_population_for')}}
),
has_insights AS (
    SELECT *, 1 as has_insights FROM {{ ref('int_patients_with_insights')}}
),
pop_union as (
    SELECT
        facility_id, patient_id
    FROM
        patient_facility_population
    UNION ALL
    SELECT
        facility_id, patient_id
    FROM
        patient_referral_population
    UNION ALL
    SELECT
        facility_id, patient_id
    FROM
        patient_insurance_population
    UNION ALL
    SELECT
        facility_id, patient_id
    FROM
        patient_visit_population
),
payor_type AS (
    SELECT
        facility_id,
        patient_id,
        argMax(payor_type, created_datetime) AS recent_payor_type
    FROM
        {{ ref('stg_gdb__patient_insurance') }}
    WHERE
        facility_id||patient_id IN (SELECT facility_id||patient_id FROM pop_union)
    GROUP BY
        facility_id, patient_id
),
ed_visit_12 AS
(
    SELECT
        patient_id, 
        COUNT(*) AS ed_visits_12_month
    FROM
        {{ ref('stg_gdb__patient_visit') }}
    WHERE
        major_class = 'E'
        AND 
        admitted_datetime BETWEEN {{ start_datetime }} - INTERVAL 11 MONTH AND {{ end_datetime }}
    GROUP BY patient_id
),
-- STEP 2: final CTE
final AS (
    SELECT
        pop.facility_id AS facility_id,
        pop.patient_id AS patient_id,
        pt.recent_payor_type AS payor_type,
        COALESCE(hi.has_insights, 0) AS has_insights,
        COALESCE(ed12.ed_visits_12_month, 0) AS ed_visits_12_month
    FROM
        pop_union pop
    LEFT JOIN
        payor_type pt ON pop.facility_id = pt.facility_id AND pop.patient_id = pt.patient_id
    LEFT JOIN
        has_insights hi ON pop.patient_id = hi.patient_id
    LEFT JOIN
        ed_visit_12 ed12 ON pop.patient_id = ed12.patient_id
)
-- STEP 3: SELECT * FROM final
SELECT * FROM final

Here is the actual query in ClickHouse logs

/* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_gdw", "target_name": "dev", "node_id": "model.dbt_gdw.population_for_previous_month"} */

        insert into dev_mkeeling_analyst.population_for_previous_month__dbt_backup ("facility_id", "patient_id", "payor_type", "has_insights", "ed_visits_12_month")

-- STEP 1: Imports
WITH patient_facility_population AS (
    SELECT * FROM dev_mkeeling_analyst.int_patient_facility_population_for_not_deleted
),
patient_referral_population AS (
    SELECT * FROM dev_mkeeling_analyst.int_referral_population_for_not_deleted
),
patient_insurance_population AS (
    SELECT * FROM dev_mkeeling_analyst.int_insurance_population_for
),
patient_visit_population AS (
    SELECT * FROM dev_mkeeling_analyst.int_visit_population_for
),
has_insights AS (
    SELECT *, 1 as has_insights FROM dev_mkeeling_analyst.int_patients_with_insights
),
pop_union as (
    SELECT
        facility_id, patient_id
    FROM
        patient_facility_population
    UNION ALL
    SELECT
        facility_id, patient_id
    FROM
        patient_referral_population
    UNION ALL
    SELECT
        facility_id, patient_id
    FROM
        patient_insurance_population
    UNION ALL
    SELECT
        facility_id, patient_id
    FROM
        patient_visit_population
),
payor_type AS (
    SELECT
        facility_id,
        patient_id,
        argMax(payor_type, created_datetime) AS recent_payor_type
    FROM
        dev_mkeeling_analyst.stg_gdb__patient_insurance
    WHERE
        facility_id||patient_id IN (SELECT facility_id||patient_id FROM pop_union)
    GROUP BY
        facility_id, patient_id
),
ed_visit_12 AS
(
    SELECT
        patient_id, 
        COUNT(*) AS ed_visits_12_month
    FROM
        dev_mkeeling_analyst.stg_gdb__patient_visit
    WHERE
        major_class = 'E'
        AND 
        admitted_datetime BETWEEN 
        toDateTime('2023-08-01 00:00:00')  
 - INTERVAL 11 MONTH AND   
        toDateTime('2023-08-31 23:59:59')
    GROUP BY patient_id
),
-- STEP 2: final CTE
final AS (
    SELECT
        pop.facility_id AS facility_id,
        pop.patient_id AS patient_id,
        pt.recent_payor_type AS payor_type,
        COALESCE(hi.has_insights, 0) AS has_insights,
        COALESCE(ed12.ed_visits_12_month, 0) AS ed_visits_12_month
    FROM
        pop_union pop
    LEFT JOIN
        payor_type pt ON pop.facility_id = pt.facility_id AND pop.patient_id = pt.patient_id
    LEFT JOIN
        has_insights hi ON pop.patient_id = hi.patient_id
    LEFT JOIN
        ed_visit_12 ed12 ON pop.patient_id = ed12.patient_id
)
-- STEP 3: SELECT * FROM final
SELECT * FROM final
genzgd commented 1 year ago

Thanks for the log -- I assume the ClickHouse query_log showed that query failed due to out of memory? Are you using the native or http driver? What's driver (clickhouse-driver or clickhouse-connect) version? (That information should really be logged, but I don't see it).

I'll see if I can reproduce the problem of dbt not reporting an OOM error next week. As a general comment I don't know what your data volume is but that many JOINs and UNION ALLs is obviously a challenge for your system, so you might want to think about your data model and possibly denormalizing some of your tables to allow simpler queries.

marckeelingiv commented 1 year ago

Correct. Clickhouse query_log indicates out of memory. Code: 241. DB::Exception: Memory limit (total) exceeded: would use 453.45 GiB (attempt to allocate chunk of 1607988799 bytes), maximum: 453.00 GiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker.: While executing CreatingSetsTransform. (MEMORY_LIMIT_EXCEEDED) (version 23.5.3.24 (official build))

The query log indicates that the query agent is: dbt/1.4.6 dbt-clickhouse/1.4.8 clickhouse-connect/0.6.4 (lv:py/3.10.10; os:win32)

marckeelingiv commented 1 year ago

And I'll let my analyst know to explore simplifying the model.

genzgd commented 1 year ago

Unfortunately I can't reproduce this in the simplest case -- a small model designed to generate an OOM exception works as expected. It's going to be difficult to track down the problem without a relatively simple reproducible model.

marckeelingiv commented 1 year ago

One last piece of information that I forgot to provide is that to make it fail I set an upstream model as a view instead of a table. When that model is set as a View then that is when the false positive is observed.

marckeelingiv commented 1 year ago

Do you think using

driver: native

instead of the default

driver: http

would make a differenc?

genzgd commented 1 year ago

It's worth seeing if the behavior is the same. You would also have to change to the native port (9000 or 9440).

oleg-savko commented 1 year ago

Find one of stable reproducible example when using several mysql table function and it lost connection to Mysql:

clickhouse query_log exception (on insert insert into mm_automation.test_case_run__dbt_backup ... ): Poco::Exception. Code: 1000, e.code() = 2013, mysqlxx::Exception: Lost connection to MySQL server during query (....:3306) (version 23.8.2.7 (official build))

dbt log:

07:31:19.263121 [debug] [ThreadPool]: dbt_clickhouse adapter: SQL status: OK in 0.28 seconds
07:31:19.265281 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '87c26943-9eed-470b-948f-cb91197d294b', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f95e86c8a00>]}
07:31:19.266296 [info ] [MainThread]: Concurrency: 1 threads (target='prod')
07:31:19.266633 [info ] [MainThread]: 
07:31:19.271437 [debug] [Thread-1 (]: Began running node model.dbt_global.test_case_run
07:31:19.272280 [info ] [Thread-1 (]: 1 of 1 START sql table model mm_automation.test_case_run ....................... [RUN]
07:31:19.272934 [debug] [Thread-1 (]: Acquiring new clickhouse connection 'model.dbt_global.test_case_run'
07:31:19.273234 [debug] [Thread-1 (]: Re-using an available connection from the pool (formerly model.dbt_global.test_case_run)
07:31:19.273542 [debug] [Thread-1 (]: Began compiling node model.dbt_global.test_case_run
07:31:19.276494 [debug] [Thread-1 (]: Writing injected SQL for node "model.dbt_global.test_case_run"
07:31:19.277237 [debug] [Thread-1 (]: Timing info for model.dbt_global.test_case_run (compile): 2023-10-03 07:31:19.273816 => 2023-10-03 07:31:19.277156
07:31:19.277611 [debug] [Thread-1 (]: Began executing node model.dbt_global.test_case_run
07:31:19.306694 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */

        set memory_overcommit_ratio_denominator = 0;
      ...
07:31:19.475870 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 0.17 seconds
07:31:19.480147 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */

        set memory_overcommit_ratio_denominator_for_user = 0;
      ...
07:31:19.664964 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 0.18 seconds
07:31:19.704396 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */

        create table mm_automation.test_case_run__dbt_backup

  engine = MergeTree()
        order by (id)

        empty
    as (

select *
from mysql(automation_stat, table = 'test_case')
union all
select *
from mysql(ru_automation_stat, table = 'test_case')
union all
select *
from mysql(sbru_automation_stat, table = 'test_case')
union all
select *
from mysql(kz_automation_stat, table = 'test_case')
    )
        ...
07:31:22.150527 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 2.45 seconds
07:31:22.166370 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */

    select name, type from system.columns where table = 'test_case_run__dbt_backup'

      and database = 'mm_automation'

    order by position
  ...
07:31:22.312705 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 0.15 seconds
07:31:22.323908 [debug] [Thread-1 (]: Writing runtime sql for node "model.dbt_global.test_case_run"
07:31:22.325276 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */

        insert into mm_automation.test_case_run__dbt_backup ("id", "test_run_uuid", "test_case_id", "scenario_name", "start_time", "duration", "status", "manual_execution_minutes")

select *
from mysql(automation_stat, table = 'test_case')
union all
select *
from mysql(ru_automation_stat, table = 'test_case')
union all
select *
from mysql(sbru_automation_stat, table = 'test_case')
union all
select *
from mysql(kz_automation_stat, table = 'test_case')
  ...
07:39:24.538243 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 482.21 seconds
07:39:24.572295 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */
EXCHANGE TABLES mm_automation.test_case_run__dbt_backup AND mm_automation.test_case_run 

  ...
07:39:24.718893 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 0.15 seconds
07:39:24.754388 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */
drop table if exists mm_automation.test_case_run__dbt_backup 
  ...
07:39:24.974994 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 0.22 seconds
07:39:24.980217 [debug] [Thread-1 (]: Timing info for model.dbt_global.test_case_run (execute): 2023-10-03 07:31:19.277886 => 2023-10-03 07:39:24.979547
07:39:24.989273 [debug] [Thread-1 (]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '87c26943-9eed-470b-948f-cb91197d294b', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f95eab99a20>]}
07:39:24.992985 [info ] [Thread-1 (]: 1 of 1 OK created sql table model mm_automation.test_case_run .................. [OK in 485.71s]
07:39:24.998341 [debug] [Thread-1 (]: Finished running node model.dbt_global.test_case_run
07:39:25.004357 [debug] [MainThread]: Acquiring new clickhouse connection 'master'
07:39:25.005809 [debug] [MainThread]: Connection 'master' was properly closed.
07:39:25.006346 [debug] [MainThread]: Connection 'model.dbt_global.test_case_run' was left open.
07:39:25.006780 [debug] [MainThread]: On model.dbt_global.test_case_run: Close
07:39:25.007532 [info ] [MainThread]: 
07:39:25.008068 [info ] [MainThread]: Finished running 1 table model in 0 hours 8 minutes and 8.90 seconds (488.90s).
07:39:25.008833 [debug] [MainThread]: Command end result
07:39:25.044562 [info ] [MainThread]: 
07:39:25.045102 [info ] [MainThread]: Completed successfully
07:39:25.045434 [info ] [MainThread]: 
07:39:25.045740 [info ] [MainThread]: Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
07:39:25.046320 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f95c8119270>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f95c8119570>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f95c8119690>]}
07:39:25.046752 [debug] [MainThread]: Flushing usage events
genzgd commented 1 year ago

Thanks for the information on an additional occurrence. I assume you're running HTTP as your driver? I think the issue is related to long running models. Unfortunately the most useful "reproducible" example would be one I could create locally.

oleg-savko commented 1 year ago

Yes we use http, but also has properties in dbt config: send_receive_timeout: 1000

and in clickhouse:

<external_storage_rw_timeout_sec>3000</external_storage_rw_timeout_sec>
<external_storage_connect_timeout_sec>2000</external_storage_connect_timeout_sec>

and without lost connection dbt model can run successfully several hours without problem

oleg-savko commented 1 year ago

One more think that I found and witch need for attention, we run our dbt models via airflow in parrallel , and have property in clickhouse: <max_memory_usage_for_user>XXXXXX</max_memory_usage_for_user>

and in one of parrallel log when get oom by user, seems in another can catch this issue

It happens constantly before last friday every day somewhere, after that in friday we change <max_memory_usage_for_user>0</max_memory_usage_for_user> we have no this issue

marckeelingiv commented 1 year ago

Switching from

driver: http

to

driver: native

seems to fix the issue. Just had to connect to port 9000 instead. Does that provide any good direction for troubleshooting the issue?

genzgd commented 1 year ago

It's a start. In many ways the native TCP connection is more stable since it is a permanent connection and errors aren't wrapped in HTTP status codes or other cruft. It also means the problem is more likely to be driver level than dbt level.

marckeelingiv commented 1 year ago

Does that mean that this issue needs to be on the ClickHouse repo?

genzgd commented 1 year ago

It's more likely to be the interaction between dbt and the clickhouse-connect driver, but hard to tell until I can find a way to reproduce locally.

quinnalfaro commented 4 months ago

Hey there, I seem be experiencing this issue as well. My model has some joins in it and error out after 4 minute according to the clickhouse logs, but it also says that the model passes and runs downstream models. I can post any logs or examples if you'd like.

genzgd commented 6 days ago

I believe this issue will be fixed if you use clickhouse_connect version 0.8.4 or later (hopefully with a recent urllib3 version installed as a dependency). We discovered a bug where ClickHouse would return a 200 success response code in some circumstances where there was actually an error (such as out of memory or the failure of a third party connection). clickhouse_connect should now properly recognize the ClickHouse error and dbt should behave accordingly.