Open tritatuy opened 3 months ago
Have you considered uploading CSV with specialized tools such as DBeaver, which will provide much more flexibility? Then, just re-sync the schema in Metabase.
Additionally, it is preferable to have read-only user profiles for the BI tools. Uploading a CSV will require Metabase to be connected via a non-read-only profile.
DBeaver
Thanks for your answer! We would like to have this feature for business users also, because there are a lot of cases when they need to upload some csv data and work with them combined with data from DWH. And DBeaver or some other tool is not convinient.
Non-read-only is also not a problem, because we can give Metabase access only to our Sandbox base on Clickhouse's host. So my request is to extend this feature to Clickhouse also, not only MySQL or PostgreSQL.
This is a bit non-trivial due to these caveats:
To be fair, I am very hesitant to implement this, given that other well-known tools do that with specialized interfaces which will yield better results.
Thanks for your feedback. My suggestions are:
Something like that:
CREATE TABLE IF NOT EXISTS sandbox.file_name ON CLUSTER clickhouse_cluster
(
col_name_1 type,
...,
col_name_n type
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/sandbox.file_name', '{replica}')
--...
CREATE TABLE IF NOT EXISTS sandbox.file_name_dist ON CLUSTER clickhouse_cluster
(
col_name_1 type,
...,
col_name_n type
)
ENGINE = Distributed(clickhouse_cluster, sandbox, file_name)
Of course we as analysts can do it in DataGrip, PyCharm and so on. But the key feature of Metabase is self-service and we want our business users to be able to upload their on data without calling developers and analysts, because they are always a bottle-neck.
According to the docs, only 2 connectors support Uploading
functionality - PostgreSQL
and MySQL
.
@tritatuy, we might consider implementing the feature request, provided there is demand from the user base. Let's collect some feedback on the current issue.
If you are up to contribute to the project, @slvrtrn might give you some pointers and hints for implementing the functionality.
According to the docs, only 2 connectors support
Uploading
functionality -PostgreSQL
andMySQL
. @tritatuy, we might consider implementing the feature request, provided there is demand from the user base. Let's collect some feedback on the current issue.
Ok, thanks! Let's wait for new requests.
I'd like to add a bit more details here.
We have three different deployment types that we want to support:
The DDLs for the tables created and specific required settings for the data insertion vary significantly based on the deployment types.
Here’s what needs to be done by the driver if we want to properly create a table and insert something there, supporting all three scenarios:
Introspect if we have more than 1 node
SELECT count(*) AS nodes_count FROM system.clusters c JOIN system.macros m ON c.cluster = m.substitution
Introspect if this is ClickHouse Cloud (it returns 1 when it is ClickHouse Cloud)
SELECT value AS is_cloud FROM system.settings WHERE name='cloud_mode'
Now, it is enough to decide what path we are taking when dealing with the inserts.
DDL:
CREATE TABLE csv_upload (...) ENGINE MergeTree ORDER BY (...)
Insert: standard JDBC methods from MB will probably work.
This is the most complicated scenario.
SELECT getMacro('cluster'), getMacro('replica'), getMacro('shard')
wait_end_of_query=1
, so that we wait until the table is created on every node on the cluster, and insert_quorum=nodes_count
, which we got from one of the queries from the prerequisites.CREATE TABLE csv_upload ON CLUSTER '{cluster}'
(...)
ENGINE ReplicatedMergeTree(
'/clickhouse/{cluster}/tables/{database}/{table}/{shard}',
'{replica}'
)
ORDER BY (...)
insert_quorum
will ensure the data ends up on every node without relying on eventual consistency.This is mostly similar to the on-premise single node (CH Cloud takes care of cluster macro, etc, in the DDL).
wait_end_of_query=1
, so that we wait until the table is created on every node on the cluster. We do not rely on quorum in the Cloud. However, to immediately get the data from any replica regardless of its sync status, it’s better also to add select_sequential_consistency=1
. This will guarantee that immediately after the insertion, we can query any node in the cluster and get the data back.CREATE TABLE csv_upload (...) ENGINE MergeTree ORDER BY (...)
I don't have time to implement all these scenarios, and I don't feel comfortable implementing support for only one deployment type.
However, if someone from the community wishes to take on this task, I can help with the setup, statements, etc. - you could always ping me here or directly in the community Slack.
(Edited by @slvrtrn to keep track of the progress of the feature implementation).
CSV uploads feature needs to support the following ClickHouse deployment types: