Uploads CSV to Clickhouse

tritatuy commented 3 months ago

(Edited by @slvrtrn to keep track of the progress of the feature implementation).

CSV uploads feature needs to support the following ClickHouse deployment types:

[ ] On-premise single node (needs double-checking after #236, as it should work, too).
[ ] On-premise cluster
[x] ClickHouse Cloud (after #236 is merged)

slvrtrn commented 3 months ago

Have you considered uploading CSV with specialized tools such as DBeaver, which will provide much more flexibility? Then, just re-sync the schema in Metabase.

Additionally, it is preferable to have read-only user profiles for the BI tools. Uploading a CSV will require Metabase to be connected via a non-read-only profile.

tritatuy commented 3 months ago

DBeaver

Thanks for your answer! We would like to have this feature for business users also, because there are a lot of cases when they need to upload some csv data and work with them combined with data from DWH. And DBeaver or some other tool is not convinient.

Non-read-only is also not a problem, because we can give Metabase access only to our Sandbox base on Clickhouse's host. So my request is to extend this feature to Clickhouse also, not only MySQL or PostgreSQL.

slvrtrn commented 3 months ago

This is a bit non-trivial due to these caveats:

We will need to introspect the ClickHouse deployment type (cluster/non-cluster); on-premise cluster DDLs are different.
The primary key for an arbitrary CSV file - just line number?
What table engine to use? On a cluster, it needs to be ReplicatedMergeTree, i.e., we also need to introspect the macro values to substitute for the DDL.

To be fair, I am very hesitant to implement this, given that other well-known tools do that with specialized interfaces which will yield better results.

tritatuy commented 3 months ago

Thanks for your feedback. My suggestions are:

Allow this feature only for cluster deployment.
To be honest - have no idea) But maybe line number is good enough. Actually Metabase doesn't ask about primary key even for uploading CSV to MySQL or PostgreSQL DBs. So it means that the solution exists.
Create 2 tables: first one's engine is ReplicatedMergeTree and the second one is the Distributed table watching on the first table. And that distributed one is the table which Metabase's result model will query to.

Something like that:

CREATE TABLE IF NOT EXISTS sandbox.file_name ON CLUSTER clickhouse_cluster
(
 col_name_1 type,
 ...,
 col_name_n type
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/sandbox.file_name', '{replica}')
--...

CREATE TABLE IF NOT EXISTS sandbox.file_name_dist ON CLUSTER clickhouse_cluster
(
 col_name_1 type,
 ...,
 col_name_n type
  )
ENGINE = Distributed(clickhouse_cluster, sandbox, file_name)

Of course we as analysts can do it in DataGrip, PyCharm and so on. But the key feature of Metabase is self-service and we want our business users to be able to upload their on data without calling developers and analysts, because they are always a bottle-neck.

mshustov commented 2 months ago

According to the docs, only 2 connectors support Uploading functionality - PostgreSQL and MySQL. @tritatuy, we might consider implementing the feature request, provided there is demand from the user base. Let's collect some feedback on the current issue. If you are up to contribute to the project, @slvrtrn might give you some pointers and hints for implementing the functionality.

tritatuy commented 2 months ago

According to the docs, only 2 connectors support Uploading functionality - PostgreSQL and MySQL. @tritatuy, we might consider implementing the feature request, provided there is demand from the user base. Let's collect some feedback on the current issue.

Ok, thanks! Let's wait for new requests.

slvrtrn commented 2 months ago

I'd like to add a bit more details here.

We have three different deployment types that we want to support:

on-premise single node (or just a local docker, as the CH driver uses)
on-premise cluster (also possible with a local docker; there is a docker-compose in the driver repo, too)
ClickHouse Cloud.

The DDLs for the tables created and specific required settings for the data insertion vary significantly based on the deployment types.

Here’s what needs to be done by the driver if we want to properly create a table and insert something there, supporting all three scenarios:

Prerequisites

Introspect if we have more than 1 node

SELECT count(*) AS nodes_count FROM system.clusters c JOIN system.macros m ON c.cluster = m.substitution

Introspect if this is ClickHouse Cloud (it returns 1 when it is ClickHouse Cloud)
```
SELECT value AS is_cloud FROM system.settings WHERE name='cloud_mode'
```
Now, it is enough to decide what path we are taking when dealing with the inserts.
- is_cloud = 0, nodes_count = 1 -> on-premise single node (with a fair amount of certainty)
- is_cloud = 0, node_count > 1 -> on premise cluster (100%)
- is_cloud = 1, ignore node_count -> CH Cloud

Connection/DDL/insert variants

On-premise single node

DDL:

CREATE TABLE csv_upload (...) ENGINE MergeTree ORDER BY (...)

Insert: standard JDBC methods from MB will probably work.

On-premise cluster

This is the most complicated scenario.

Introspect macro values. We need all three.

SELECT getMacro('cluster'), getMacro('replica'), getMacro('shard')

Obtain the JDBC connection with two extra parameters: wait_end_of_query=1, so that we wait until the table is created on every node on the cluster, and insert_quorum=nodes_count, which we got from one of the queries from the prerequisites.

DDL is entirely different from a single-node setup

CREATE TABLE csv_upload ON CLUSTER '{cluster}'
(...)
ENGINE ReplicatedMergeTree(
'/clickhouse/{cluster}/tables/{database}/{table}/{shard}',
'{replica}'
)
ORDER BY (...)

Insert as usual using the default JDBC methods from the MB sources. insert_quorum will ensure the data ends up on every node without relying on eventual consistency.

CH Cloud

This is mostly similar to the on-premise single node (CH Cloud takes care of cluster macro, etc, in the DDL).

Obtain the JDBC connection with at least one extra parameter, wait_end_of_query=1, so that we wait until the table is created on every node on the cluster. We do not rely on quorum in the Cloud. However, to immediately get the data from any replica regardless of its sync status, it’s better also to add select_sequential_consistency=1. This will guarantee that immediately after the insertion, we can query any node in the cluster and get the data back.

DDL is similar to the single-node deployment:

CREATE TABLE csv_upload (...) ENGINE MergeTree ORDER BY (...)

Insert as usual using the default JDBC methods from the MB sources.

I don't have time to implement all these scenarios, and I don't feel comfortable implementing support for only one deployment type.

However, if someone from the community wishes to take on this task, I can help with the setup, statements, etc. - you could always ping me here or directly in the community Slack.

slvrtrn commented 1 month ago

1.5.0 supports CSV uploads with ClickHouse Cloud (props to @calherries). I will add proper support for on-premise clusters after 0.50.x.

ClickHouse / metabase-clickhouse-driver