ClickHouse / metabase-clickhouse-driver

ClickHouse database driver for the Metabase business intelligence front-end
Apache License 2.0
476 stars 92 forks source link

ClickHouse Cloud Uploads: Appending duplicate CSV file doesn't append new values #237

Closed calherries closed 4 months ago

calherries commented 4 months ago

Bug with https://github.com/ClickHouse/metabase-clickhouse-driver/pull/236

If you append the same CSV file to a table that you have uploaded to the table previously, new rows will not be inserted into the table.

Steps to reproduce:

  1. Enable CSV uploads on a ClickHouse Cloud DB.
  2. Refresh the page.
  3. Upload a CSV file to a collection.
  4. Append the same CSV file to the existing model created in the previous step. Note the rows in the table is the same from step 3 even though the UI reports the append action as succeeding.
  5. Modify the data in the CSV file, and repeat step 4, appending the CSV file to the model. Note the data is now actually appended to the data in the table.
slvrtrn commented 4 months ago

I think that's how deduplication works.

If you create a table in CH Cloud like this:

CREATE TABLE foo
(
    `i` UInt32,
    `s` String
)
ENGINE = MergeTree
ORDER BY i;

And then insert the same thing two times:

INSERT INTO foo VALUES (1, 'q');
INSERT INTO foo VALUES (1, 'q');

The result will be:

SELECT * FROM foo FORMAT JSONEachRow;
{"i":1,"s":"q"}

I think this is related: https://clickhouse.com/blog/common-getting-started-issues-with-clickhouse#5-deduplication-at-insert-time

calherries commented 4 months ago

Got it. So to allow users to upload duplicate data, we'd have to create a unique column. We can do that with an auto-incrementing integer column or UUID column. I'm now thinking this is a good reason to use UUIDs for the auto-generated primary key column _mb_row_id for ClickHouse uploads. Previously I avoided that because I wanted to wait for auto-incrementing types to be supported by ClickHouse.

slvrtrn commented 4 months ago

Is uploading duplicates over a particular (small) time window a popular use case?

Additionally, you can, of course, try to disable https://clickhouse.com/docs/en/operations/settings/merge-tree-settings#replicated-deduplication-window.

slvrtrn commented 4 months ago

To enable duplicate inserts, if that is a popular use-case and is necessary, you could do the following during the table creation:

CREATE TABLE foo
(
    `i` UInt32,
    `s` String
)
ENGINE = MergeTree
ORDER BY i
SETTINGS replicated_deduplication_window=0