ClickHouse / metabase-clickhouse-driver

ClickHouse database driver for the Metabase business intelligence front-end
Apache License 2.0
461 stars 84 forks source link

CSV uploads for ClickHouse Cloud #236

Closed calherries closed 1 month ago

calherries commented 1 month ago

Summary

Issue tracking CSV uploads: https://github.com/ClickHouse/metabase-clickhouse-driver/issues/230

This PR adds support for the uploads driver feature for ClickHouse. The feature is currently only supported for ClickHouse Cloud. It requires Metabase x.49.11 x.49.12 to work correctly.

Limitations:

Checklist

Delete items not relevant to your PR:

CLAassistant commented 1 month ago

CLA assistant check
All committers have signed the CLA.

calherries commented 1 month ago

@slvrtrn I have a few questions: (1) Is there a way to run tests with ClickHouse Cloud deployments? (2) Is there an alternative way to determine the number of replicas without using system.macros, as you suggested here? I can create a ClickHouse cloud user without sufficient privileges to select from the system.macros table.

Executing this query

SELECT count(*) AS nodes_count FROM system.clusters c JOIN system.macros m ON c.cluster = m.substitution

gives me this error

> Not enough privileges. To execute this query, it's necessary to have the grant SELECT(substitution) ON system.macros

(3) If there's no solution to (2), would you be willing to accept this contribution if it only worked for ClickHouse Cloud in the meantime? That's the deployment type we care about most. With this commit I've updated the PR so we only support ClickHouse Cloud.

slvrtrn commented 1 month ago

(1) — You can provide the hostname and port via the env variables; IIRC, it won't work correctly unless you add wait_end_of_query=1 to the JDBC connection string, cause it will not wait until the tables are created on all replicas, and the driver might immediately call the wrong one.

(2) - I am only aware of the mentioned method

I can create a ClickHouse cloud user without sufficient privileges to select from the system.macros table.

You don't need this in CH Cloud. It's necessary only for on-premise. But yes, these grants are one of the main pain points.

(3) It will work for single-node and ClickHouse Cloud, then. But we need a way to disable it for the on-premise cluster if it's detected. Can we do that dynamically? It's the feature that is enabled or disabled on the driver level, not on the connection level.

calherries commented 1 month ago

(1) Thanks, I can run the tests locally with ClickHouse Cloud. I should have been more specific with my question: Is there a way to run the tests that run with Github Actions with ClickHouse Cloud deployments?

(2) Ok thanks for clarifying

(3) How can we know whether it's an on-premise cluster vs on-premise single node without getting the node_count? You only suggested an approach using node_count in your comment here.

is_cloud = 0, nodes_count = 1 -> on-premise single node

calherries commented 1 month ago

@slvrtrn tagging you because you might have missed my last comment above, where I didn't tag you

slvrtrn commented 1 month ago

@calherries (1) - currently, no. We need to add this; I will do that as a follow-up (after both this and #232 are merged). (3) - this is the best option I am aware of. I will clarify if there is anything else available, but chances are that that's it. The grants issue is annoying, but it's necessary only for on-premise deployments (and not CH Cloud, as you don't need this query there); we can add a doc entry about the required grants (after we add on-prem clusters support as a follow-up PR).

I have another question, though. Is there a way to disable a particular feature for a specific connection? These are defined in the driver/database-supports? override, but a cleaner solution would've been to enable it based on the deployment type (i.e. after we have the connection details). This is useful in this PR atm and also in #232 cause the impersonation (another feature defined via an override) should be enabled only on the most recent CH version.

calherries commented 1 month ago

(3) Okay, for now I'll support CH Cloud only.

Is there a way to disable a particular feature for a specific connection?

Yes, we generally use driver/database-supports? for this, you can have the result depend on the version or the connection details. See this example with MongoDB. I've also done a similar thing for this PR in the last commit 6b8935b.

I think this PR is ready to be reviewed now, but we'll need to wait for the changes in this PR to be released in the 49 branch before there is a released Metabase version that is compatible with this.

calherries commented 1 month ago

For this:

A human-readable description of the changes was provided to include in CHANGELOG

What version should these changes under in the changelog?

slvrtrn commented 1 month ago

What version should these changes under in the changelog?

I will add this, don't worry. Otherwise, there will be conflicts with #232

you can have the result depend on the version or the connection details. See this example with MongoDB. I've also done a similar thing for this PR in the last commit 6b8935b.

Thanks, this is great.

slvrtrn commented 1 month ago

btw

Resolves https://github.com/ClickHouse/metabase-clickhouse-driver/issues/230

It doesn't; IIUC, the user that requested this feature, uses on-prem.

slvrtrn commented 1 month ago

@calherries, some tests are failing on the CI.

There is one more thing. In my issue comment, I mentioned

To immediately get the data from any replica regardless of its sync status, it’s also better to add select_sequential_consistency=1. This will guarantee that immediately after the insertion, we can query any node in the cluster and get the data back.

This is an absolute necessity if we add the Cloud test run cause we need to immediately get the test data back after the table is created; otherwise, we will experience weird flakiness during the runs. This also can happen in real-world usage (albeit more rarely).

calherries commented 1 month ago

@slvrtrn

select_sequential_consistency=1.

I struggled with this, because we're using a connection pool and don't have any way to set settings in a SQL query. I've made an attempt here but I'm concerned about the performance implications of having this set for every query, even if it's only needed for queries on uploaded tables with CH Cloud. set select_sequential_consistency in connection details

update: I've implemented a workaround with Avoid setting select_sequential_consistency for on-premise

slvrtrn commented 1 month ago

Let's bump the MB version to 0.49.11 in .github/workflows/check.yml#L20 and merge if the CI is green.