Closed alexsubota closed 6 days ago
What is the query in the ClickHouse logs that is throwing the exception (this should be available in the system.query_log
? Are you using distributed tables? Are you using lightweight deletes? Materialized views? As far as I know all errors from ClickHouse should be propagated up to dbt, so I would need more details to understand where (and maybe why) that is not happening.
this should be available in the system.query_log
Yes, its available
Are you using distributed tables?
Nope, only MergeTree table with order by, no more settings Model have simple select from table with many joins
Are you using lightweight deletes?
Nope
Materialized_ views?
Nope
so what is the query that is causing memory limit exceeded? I'm guessing it's an insert into the table? Is it the CREATE TABLE statement? Is it an incremental materialization?
Not incremental, default table.
query smth like this
select col1, col2, col3, arrayFirst(), arrayFirst(), arrayCount() .....
from table t
left join (select id, groupArray(...) from table2 group by id) t2
on t1.id=t2.id
Do you have any error lines at all in the logs? When I run a bad model, I get something like this:
15:00:57.416070 [error] [MainThread]: Encountered an error:
Database Error
Error executing HTTP request http://localhost:8123
Unfortunately it does NOT have the useful information like "memory limit exceeded" you are seeing, but I want to make sure you are seeing at least that line.
One more question -- is the SELECT
statement that is generating the Memory Limit Exceeded
error part of a CREATE TABLE AS SELECT ... FROM
query or part of an INSERT INTO SELECT ... FROM
statement. And if it's a create table statement, is it followed by EMPTY
(or empty
)?
So, dbt task was started at 02:06:22 and finished at 02:10:48
In the clickhouse log I see the following
02:06:22 QueryStart
create table test_db.test_table__dbt_backup engine = MergeTree() order by (id) empty as (....
02:06:22 QueryFinish
create table test_db.test_table__dbt_backup engine = MergeTree() order by (id) empty as (....
02:06:22 QueryStart
select name, type from system.columns where table = 'test_table__dbt_backup' and database = 'test_db' order by position
02:06:22 QueryFinish
select name, type from system.columns where table = 'test_table__dbt_backup' and database = 'test_db' order by position
02:06:22 QueryStart
insert into test_db.test_table__dbt_backup ......
02:10:46 ExceptionWhileProcessing
insert into test_db.test_table__dbt_backup ......
02:10:48 QueryStart
EXCHANGE TABLES test_db.test_table__dbt_backup AND test_db.test_table
02:10:48 QueryFinish
EXCHANGE TABLES test_db.test_table__dbt_backup AND test_db.test_table
02:10:48 QueryStart
drop table if exists test_db.test_table__dbt_backup
02:10:48 QueryFinish
drop table if exists test_db.test_table__dbt_backup
That looks like dbt is not seeing the exception from the client and keeps going like the insert succeeded. Are you using the HTTP or the Native interface?
And can you also post what is in the dbt logs for that time period, if anything? I'm trying to understand why that exception wasn't seen by dbt.
That looks like dbt is not seeing the exception from the client and keeps going like the insert succeeded. Are you using the HTTP or the Native interface?
HTTP
And can you also post what is in the dbt logs for that time period, if anything? I'm trying to understand why that exception wasn't seen by dbt.
This is not a recurring error, I don't have an up to date log right now
Sorry for all the questions, but I haven't been able to reproduce this behavior and I would really like to figure out why this happens. What versions of clickhouse-connect
and dbt-clickhouse
did you have installed?
Version 1.3.1 for both, dbt and dbt-clickhouse
Can you update dbt-clickhouse to 1.3.3 and clickhouse-connect to the latest version (0.5.10)?
@genzgd dbt=1.4.6 Still got this problem
Logs:
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - [0m06:46:08
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - [0m06:46:08 Finished running 1 table model in 0 hours 4 minutes and 2.68 seconds (242.68s).
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - [0m06:46:08
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - [0m06:46:08 [32mCompleted successfully[0m
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - [0m06:46:08
[2023-08-04, 06:46:08 UTC] {subprocess.py:93} INFO - [0m06:46:08 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
Clickhouse query log:
What is the actual model definition? What is the actual query in the ClickHouse logs? What is the in dbt debug log? Since thousands of dbt-clickhouse models are successfully are run every day, there's clearly something in your specific situation that is different, and it's not possible to tell what without complete diagnostic information.
Here is the dbt.log file when I get this error
Here is the model that when I am running when this error is produced.
{{ config(
order_by = [ "patient_id" ]
) }}
{% set start_datetime = date_parse(var('CUO_start_datetime'), 'datetime') %}
{% set end_datetime = date_parse(var('CUO_end_datetime'), 'datetime') %}
-- STEP 1: Imports
WITH patient_facility_population AS (
SELECT * FROM {{ ref('int_patient_facility_population_for_not_deleted') }}
),
patient_referral_population AS (
SELECT * FROM {{ ref('int_referral_population_for_not_deleted')}}
),
patient_insurance_population AS (
SELECT * FROM {{ ref('int_insurance_population_for')}}
),
patient_visit_population AS (
SELECT * FROM {{ ref('int_visit_population_for')}}
),
has_insights AS (
SELECT *, 1 as has_insights FROM {{ ref('int_patients_with_insights')}}
),
pop_union as (
SELECT
facility_id, patient_id
FROM
patient_facility_population
UNION ALL
SELECT
facility_id, patient_id
FROM
patient_referral_population
UNION ALL
SELECT
facility_id, patient_id
FROM
patient_insurance_population
UNION ALL
SELECT
facility_id, patient_id
FROM
patient_visit_population
),
payor_type AS (
SELECT
facility_id,
patient_id,
argMax(payor_type, created_datetime) AS recent_payor_type
FROM
{{ ref('stg_gdb__patient_insurance') }}
WHERE
facility_id||patient_id IN (SELECT facility_id||patient_id FROM pop_union)
GROUP BY
facility_id, patient_id
),
ed_visit_12 AS
(
SELECT
patient_id,
COUNT(*) AS ed_visits_12_month
FROM
{{ ref('stg_gdb__patient_visit') }}
WHERE
major_class = 'E'
AND
admitted_datetime BETWEEN {{ start_datetime }} - INTERVAL 11 MONTH AND {{ end_datetime }}
GROUP BY patient_id
),
-- STEP 2: final CTE
final AS (
SELECT
pop.facility_id AS facility_id,
pop.patient_id AS patient_id,
pt.recent_payor_type AS payor_type,
COALESCE(hi.has_insights, 0) AS has_insights,
COALESCE(ed12.ed_visits_12_month, 0) AS ed_visits_12_month
FROM
pop_union pop
LEFT JOIN
payor_type pt ON pop.facility_id = pt.facility_id AND pop.patient_id = pt.patient_id
LEFT JOIN
has_insights hi ON pop.patient_id = hi.patient_id
LEFT JOIN
ed_visit_12 ed12 ON pop.patient_id = ed12.patient_id
)
-- STEP 3: SELECT * FROM final
SELECT * FROM final
Here is the actual query in ClickHouse logs
/* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_gdw", "target_name": "dev", "node_id": "model.dbt_gdw.population_for_previous_month"} */
insert into dev_mkeeling_analyst.population_for_previous_month__dbt_backup ("facility_id", "patient_id", "payor_type", "has_insights", "ed_visits_12_month")
-- STEP 1: Imports
WITH patient_facility_population AS (
SELECT * FROM dev_mkeeling_analyst.int_patient_facility_population_for_not_deleted
),
patient_referral_population AS (
SELECT * FROM dev_mkeeling_analyst.int_referral_population_for_not_deleted
),
patient_insurance_population AS (
SELECT * FROM dev_mkeeling_analyst.int_insurance_population_for
),
patient_visit_population AS (
SELECT * FROM dev_mkeeling_analyst.int_visit_population_for
),
has_insights AS (
SELECT *, 1 as has_insights FROM dev_mkeeling_analyst.int_patients_with_insights
),
pop_union as (
SELECT
facility_id, patient_id
FROM
patient_facility_population
UNION ALL
SELECT
facility_id, patient_id
FROM
patient_referral_population
UNION ALL
SELECT
facility_id, patient_id
FROM
patient_insurance_population
UNION ALL
SELECT
facility_id, patient_id
FROM
patient_visit_population
),
payor_type AS (
SELECT
facility_id,
patient_id,
argMax(payor_type, created_datetime) AS recent_payor_type
FROM
dev_mkeeling_analyst.stg_gdb__patient_insurance
WHERE
facility_id||patient_id IN (SELECT facility_id||patient_id FROM pop_union)
GROUP BY
facility_id, patient_id
),
ed_visit_12 AS
(
SELECT
patient_id,
COUNT(*) AS ed_visits_12_month
FROM
dev_mkeeling_analyst.stg_gdb__patient_visit
WHERE
major_class = 'E'
AND
admitted_datetime BETWEEN
toDateTime('2023-08-01 00:00:00')
- INTERVAL 11 MONTH AND
toDateTime('2023-08-31 23:59:59')
GROUP BY patient_id
),
-- STEP 2: final CTE
final AS (
SELECT
pop.facility_id AS facility_id,
pop.patient_id AS patient_id,
pt.recent_payor_type AS payor_type,
COALESCE(hi.has_insights, 0) AS has_insights,
COALESCE(ed12.ed_visits_12_month, 0) AS ed_visits_12_month
FROM
pop_union pop
LEFT JOIN
payor_type pt ON pop.facility_id = pt.facility_id AND pop.patient_id = pt.patient_id
LEFT JOIN
has_insights hi ON pop.patient_id = hi.patient_id
LEFT JOIN
ed_visit_12 ed12 ON pop.patient_id = ed12.patient_id
)
-- STEP 3: SELECT * FROM final
SELECT * FROM final
Thanks for the log -- I assume the ClickHouse query_log showed that query failed due to out of memory? Are you using the native or http driver? What's driver (clickhouse-driver or clickhouse-connect) version? (That information should really be logged, but I don't see it).
I'll see if I can reproduce the problem of dbt not reporting an OOM error next week. As a general comment I don't know what your data volume is but that many JOINs and UNION ALLs is obviously a challenge for your system, so you might want to think about your data model and possibly denormalizing some of your tables to allow simpler queries.
Correct. Clickhouse query_log indicates out of memory. Code: 241. DB::Exception: Memory limit (total) exceeded: would use 453.45 GiB (attempt to allocate chunk of 1607988799 bytes), maximum: 453.00 GiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker.: While executing CreatingSetsTransform. (MEMORY_LIMIT_EXCEEDED) (version 23.5.3.24 (official build))
The query log indicates that the query agent is: dbt/1.4.6 dbt-clickhouse/1.4.8 clickhouse-connect/0.6.4 (lv:py/3.10.10; os:win32)
And I'll let my analyst know to explore simplifying the model.
Unfortunately I can't reproduce this in the simplest case -- a small model designed to generate an OOM exception works as expected. It's going to be difficult to track down the problem without a relatively simple reproducible model.
One last piece of information that I forgot to provide is that to make it fail I set an upstream model as a view instead of a table. When that model is set as a View then that is when the false positive is observed.
Do you think using
driver: native
instead of the default
driver: http
would make a differenc?
It's worth seeing if the behavior is the same. You would also have to change to the native port (9000 or 9440).
Find one of stable reproducible example when using several mysql table function and it lost connection to Mysql:
clickhouse query_log exception (on insert insert into mm_automation.test_case_run__dbt_backup ...
):
Poco::Exception. Code: 1000, e.code() = 2013, mysqlxx::Exception: Lost connection to MySQL server during query (....:3306) (version 23.8.2.7 (official build))
dbt log:
07:31:19.263121 [debug] [ThreadPool]: dbt_clickhouse adapter: SQL status: OK in 0.28 seconds
07:31:19.265281 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '87c26943-9eed-470b-948f-cb91197d294b', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f95e86c8a00>]}
07:31:19.266296 [info ] [MainThread]: Concurrency: 1 threads (target='prod')
07:31:19.266633 [info ] [MainThread]:
07:31:19.271437 [debug] [Thread-1 (]: Began running node model.dbt_global.test_case_run
07:31:19.272280 [info ] [Thread-1 (]: 1 of 1 START sql table model mm_automation.test_case_run ....................... [RUN]
07:31:19.272934 [debug] [Thread-1 (]: Acquiring new clickhouse connection 'model.dbt_global.test_case_run'
07:31:19.273234 [debug] [Thread-1 (]: Re-using an available connection from the pool (formerly model.dbt_global.test_case_run)
07:31:19.273542 [debug] [Thread-1 (]: Began compiling node model.dbt_global.test_case_run
07:31:19.276494 [debug] [Thread-1 (]: Writing injected SQL for node "model.dbt_global.test_case_run"
07:31:19.277237 [debug] [Thread-1 (]: Timing info for model.dbt_global.test_case_run (compile): 2023-10-03 07:31:19.273816 => 2023-10-03 07:31:19.277156
07:31:19.277611 [debug] [Thread-1 (]: Began executing node model.dbt_global.test_case_run
07:31:19.306694 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */
set memory_overcommit_ratio_denominator = 0;
...
07:31:19.475870 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 0.17 seconds
07:31:19.480147 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */
set memory_overcommit_ratio_denominator_for_user = 0;
...
07:31:19.664964 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 0.18 seconds
07:31:19.704396 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */
create table mm_automation.test_case_run__dbt_backup
engine = MergeTree()
order by (id)
empty
as (
select *
from mysql(automation_stat, table = 'test_case')
union all
select *
from mysql(ru_automation_stat, table = 'test_case')
union all
select *
from mysql(sbru_automation_stat, table = 'test_case')
union all
select *
from mysql(kz_automation_stat, table = 'test_case')
)
...
07:31:22.150527 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 2.45 seconds
07:31:22.166370 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */
select name, type from system.columns where table = 'test_case_run__dbt_backup'
and database = 'mm_automation'
order by position
...
07:31:22.312705 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 0.15 seconds
07:31:22.323908 [debug] [Thread-1 (]: Writing runtime sql for node "model.dbt_global.test_case_run"
07:31:22.325276 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */
insert into mm_automation.test_case_run__dbt_backup ("id", "test_run_uuid", "test_case_id", "scenario_name", "start_time", "duration", "status", "manual_execution_minutes")
select *
from mysql(automation_stat, table = 'test_case')
union all
select *
from mysql(ru_automation_stat, table = 'test_case')
union all
select *
from mysql(sbru_automation_stat, table = 'test_case')
union all
select *
from mysql(kz_automation_stat, table = 'test_case')
...
07:39:24.538243 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 482.21 seconds
07:39:24.572295 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */
EXCHANGE TABLES mm_automation.test_case_run__dbt_backup AND mm_automation.test_case_run
...
07:39:24.718893 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 0.15 seconds
07:39:24.754388 [debug] [Thread-1 (]: dbt_clickhouse adapter: On model.dbt_global.test_case_run: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "dbt_global", "target_name": "prod", "node_id": "model.dbt_global.test_case_run"} */
drop table if exists mm_automation.test_case_run__dbt_backup
...
07:39:24.974994 [debug] [Thread-1 (]: dbt_clickhouse adapter: SQL status: OK in 0.22 seconds
07:39:24.980217 [debug] [Thread-1 (]: Timing info for model.dbt_global.test_case_run (execute): 2023-10-03 07:31:19.277886 => 2023-10-03 07:39:24.979547
07:39:24.989273 [debug] [Thread-1 (]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '87c26943-9eed-470b-948f-cb91197d294b', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f95eab99a20>]}
07:39:24.992985 [info ] [Thread-1 (]: 1 of 1 OK created sql table model mm_automation.test_case_run .................. [OK in 485.71s]
07:39:24.998341 [debug] [Thread-1 (]: Finished running node model.dbt_global.test_case_run
07:39:25.004357 [debug] [MainThread]: Acquiring new clickhouse connection 'master'
07:39:25.005809 [debug] [MainThread]: Connection 'master' was properly closed.
07:39:25.006346 [debug] [MainThread]: Connection 'model.dbt_global.test_case_run' was left open.
07:39:25.006780 [debug] [MainThread]: On model.dbt_global.test_case_run: Close
07:39:25.007532 [info ] [MainThread]:
07:39:25.008068 [info ] [MainThread]: Finished running 1 table model in 0 hours 8 minutes and 8.90 seconds (488.90s).
07:39:25.008833 [debug] [MainThread]: Command end result
07:39:25.044562 [info ] [MainThread]:
07:39:25.045102 [info ] [MainThread]: Completed successfully
07:39:25.045434 [info ] [MainThread]:
07:39:25.045740 [info ] [MainThread]: Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
07:39:25.046320 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f95c8119270>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f95c8119570>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7f95c8119690>]}
07:39:25.046752 [debug] [MainThread]: Flushing usage events
Thanks for the information on an additional occurrence. I assume you're running HTTP as your driver? I think the issue is related to long running models. Unfortunately the most useful "reproducible" example would be one I could create locally.
Yes we use http, but also has properties in dbt config:
send_receive_timeout: 1000
and in clickhouse:
<external_storage_rw_timeout_sec>3000</external_storage_rw_timeout_sec>
<external_storage_connect_timeout_sec>2000</external_storage_connect_timeout_sec>
and without lost connection dbt model can run successfully several hours without problem
One more think that I found and witch need for attention, we run our dbt models via airflow in parrallel , and have property in clickhouse:
<max_memory_usage_for_user>XXXXXX</max_memory_usage_for_user>
and in one of parrallel log when get oom by user, seems in another can catch this issue
It happens constantly before last friday every day somewhere, after that in friday we change
<max_memory_usage_for_user>0</max_memory_usage_for_user>
we have no this issue
Switching from
driver: http
to
driver: native
seems to fix the issue. Just had to connect to port 9000 instead. Does that provide any good direction for troubleshooting the issue?
It's a start. In many ways the native TCP connection is more stable since it is a permanent connection and errors aren't wrapped in HTTP status codes or other cruft. It also means the problem is more likely to be driver level than dbt level.
Does that mean that this issue needs to be on the ClickHouse repo?
It's more likely to be the interaction between dbt
and the clickhouse-connect
driver, but hard to tell until I can find a way to reproduce locally.
Hey there, I seem be experiencing this issue as well. My model has some joins in it and error out after 4 minute according to the clickhouse logs, but it also says that the model passes and runs downstream models. I can post any logs or examples if you'd like.
I believe this issue will be fixed if you use clickhouse_connect
version 0.8.4 or later (hopefully with a recent urllib3
version installed as a dependency). We discovered a bug where ClickHouse would return a 200 success response code in some circumstances where there was actually an error (such as out of memory or the failure of a third party connection). clickhouse_connect
should now properly recognize the ClickHouse error and dbt should behave accordingly.
Hi, sometimes I encounter strange dbt behavior when running a model and it says everything is fine, but when I check the clickhouse logs, it says that this run failed due to "Memory limit exceeded".
`Output: Running with dbt=1.3.2 Found 298 models, 8 tests, 0 snapshots, 105 analyses, 316 macros, 0 operations, 14 seed files, 213 sources, 17 exposures, 0 metrics Concurrency: 4 threads (target='prod')
1 of 1 START sql table model test_db.test_table .................. [RUN] 1 of 1 OK created sql table model test_db.test_table ............. [[32mOK[0m in 266.71s] Finished running 1 table model in 0 hours 4 minutes and 27.58 seconds (267.58s). Completed successfully PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1`
How can i fix this problem, and get error at dbt logs to, same as clickhouse logs
ClickHouse version 22.12 DBT version 1.3.0