Open filipesilva opened 2 weeks ago
@filipesilva thanks for the repro! this makes fixing bug easier. on the first look something really weird happens - on clickhouse, code that handles nested tables is used to generate merge sql. there should be no nested tables because you use arrow backend which should produce a single table.
I'll take a look.
in the meantime you may try to force-add _dlt_id to your arrow tables: https://dlthub.com/docs/dlt-ecosystem/verified-sources/arrow-pandas#add-_dlt_load_id-and-_dlt_id-to-your-tables
and see what tables you get in clickhouse at the end.
Hi @rudolfix, thanks for the quick reply! Sorry for only getting back to you now, I was on vacation last week.
I added this to .dlt/config.toml
:
[normalize.parquet_normalizer]
add_dlt_load_id = true
add_dlt_id = true
then ran the clickhouse repro again
[I] (.env) filipesilva@Filipes-MBP ~/s/dlt-replace-merge-bug (master)> python sql_database_pipeline.py clickhouse
=== Replace with cursor then merge with cursor ===
2024-10-29 12:31:18,130|[WARNING]|9571|8176547904|dlt|job_client_impl.py|_check_table_update_hints:583|Column(s) ['`_dlt_id`'] with hint row_key are being added to existing table jobs. Several hint types may not be added to existing tables.
2024-10-29 12:31:18,131|[WARNING]|9571|8176547904|dlt|job_client_impl.py|_check_table_update_hints:583|Column(s) [] with hint nullable are being added to existing table jobs. Several hint types may not be added to existing tables.
2024-10-29 12:31:18,131|[WARNING]|9571|8176547904|dlt|job_client_impl.py|_check_table_update_hints:583|Column(s) ['`_dlt_id`'] with hint unique are being added to existing table jobs. Several hint types may not be added to existing tables.
Pipeline repro_pipeline load step completed in 1.08 seconds
1 load package(s) were loaded to destination clickhouse and into dataset repro_dataset
The clickhouse destination used clickhouse://username:***@localhost:9000/my_database location to store data
Load package 1730205077.668959 is LOADED and contains no failed jobs
Pipeline repro_pipeline load step completed in ---
0 load package(s) were loaded to destination clickhouse and into dataset None
The clickhouse destination used clickhouse://username:***@localhost:9000/my_database location to store data
=== Replace without cursor then merge with cursor ===
Pipeline repro_pipeline load step completed in 1.07 seconds
1 load package(s) were loaded to destination clickhouse and into dataset repro_dataset
The clickhouse destination used clickhouse://username:***@localhost:9000/my_database location to store data
Load package 1730205079.340509 is LOADED and contains no failed jobs
2024-10-29 12:31:21,372|[WARNING]|9571|8176547904|dlt|job_client_impl.py|_check_table_update_hints:583|Column(s) ['`_dlt_id`'] with hint row_key are being added to existing table jobs. Several hint types may not be added to existing tables.
2024-10-29 12:31:21,372|[WARNING]|9571|8176547904|dlt|job_client_impl.py|_check_table_update_hints:583|Column(s) [] with hint nullable are being added to existing table jobs. Several hint types may not be added to existing tables.
2024-10-29 12:31:21,372|[WARNING]|9571|8176547904|dlt|job_client_impl.py|_check_table_update_hints:583|Column(s) ['`_dlt_id`'] with hint unique are being added to existing table jobs. Several hint types may not be added to existing tables.
Pipeline repro_pipeline load step completed in 2.13 seconds
1 load package(s) were loaded to destination clickhouse and into dataset repro_dataset
The clickhouse destination used clickhouse://username:***@localhost:9000/my_database location to store data
Load package 1730205080.8826492 is LOADED and contains no failed jobs
So indeed it does finish successfully now.
On clickhouse, the result table has these fields and data:
[I] (.env) filipesilva@Filipes-MBP ~/s/dlt-replace-merge-bug (master)> echo "DESCRIBE TABLE my_database.repro_dataset___jobs" | curl 'username:password@localhost:8123/?query=' -s --data-binary @-
number Int64
name Nullable(String)
_dlt_load_id String
_dlt_id String
[I] (.env) filipesilva@Filipes-MBP ~/s/dlt-replace-merge-bug (master)> echo "SELECT * FROM my_database.repro_dataset___jobs" | curl 'username:password@localhost:8123/?query=' -s --data-binary @-
1 foo 1730205089.189604 R3k0zo69UOG+Ew
2 bar 1730205089.189604 CmfC6CodtnoaPA
Is this the resolution or is it more of a workaround? Ideally the destination table would not end up with the dlt fields.
dlt version
1.2.0
Describe the problem
On Clickhouse, If I switch to
write_disposition="merge"
and add a cursor after the pipeline was ran withwrite_disposition="replace"
and no cursor, the run will fail with the following error:After this failure mode, subsequent runs will log
The pipeline
runmethod will now load the pending load packages. The data you passed to the run function will not be loaded. In order to do that you must run the pipeline again
and attempt to re-run the failed pipeline, failing indefinitely until I runrm -rf ~/.dlt/pipelines/
.This does not happen with postgres, and it does not happen if the initial
write_disposition="replace"
was ran with a cursor.Expected behavior
I expect to be able to add a cursor and change write disposition on clickhouse in the same way it is possible with postgres.
Steps to reproduce
Clone and install dependencies for https://github.com/filipesilva/dlt-replace-merge-bug
In separate terminal windows, run the following scripts to launch clickhouse and postgres docker containers:
Go back to the first terminal to and run the postgres pipeline
You should see
But if you run the clickhouse pipeline, it will error out
After this failure mode, subsequent runs will log
The pipeline
runmethod will now load the pending load packages. The data you passed to the run function will not be loaded. In order to do that you must run the pipeline again
and attempt to re-run the failed pipeline, failing indefinitely until I runrm -rf ~/.dlt/pipelines/
. You can run this via./scripts/clear-pending-pipelines.sh
.Operating system
macOS
Runtime environment
Local
Python version
3.11
dlt data source
SQLite
dlt destination
No response
Other deployment details
I'm using Clickhouse as a destination, and Python 3.12.5 but they do not appear on the bug issue dropdowns.
Additional information
No response