Roadmap 2024 (discussion)

alexey-milovidov commented 6 months ago

This is ClickHouse roadmap 2024. Descriptions and links are to be filled.

This roadmap does not cover the tasks related to infrastructure, orchestration, documentation, marketing, external integrations, drivers, etc.

SQL Compatibility

✔️ Enable Analyzer by default Non-constant CASE, non-constant IN Remove old predicate pushdown mechanics Correlated subqueries with decorrelation Transforming anti-join: LEFT JOIN ... WHERE ... IS NULL to NOT IN Deriving index condition from the right-hand side of INNER JOIN JOINs reordering and extended pushdown Time data type

Data Storage

✔️ Userspace page cache ✔️ Adaptive mode for asynchronous inserts ✔️ Semistructured Data: Variant data type Semistructured Data: Sharded Maps Semistructured Data: JSON data type Transactions for Replicated tables Lightweight Updates v2 Uniform treatment of LowCardinality, Sparse, and Const columns Settings to control the consistency of projections on updates Replicated Catalog :cloud: On-disk storage for Keeper Query cache on disk ✔️ Decoupling of object storages and metadata Full-text indices (production readiness) Vector search indices (production readiness)

Security, access control, and isolation

✔️ Definers (encapsulation of access control) for views Warnings and limits on the number of database objects Dynamic configuration of query handlers JWT authentication :cloud: Data masking in row-level security :cloud: Secure storage for named collections :cloud: Cancellation points for long operations Resource scheduler (continuation)

Query Processing

Parallel replicas with task callbacks (production readiness) Parallel replicas with parallel distributed INSERT SELECT Automatic usage of -Cluster table functions Adaptive thresholds for data spilling on disk Optimization with subcolumns by default

Interfaces & External Data

Support for Iceberg Data Catalog Support for Hive-style partitioning Explicit queries in external tables Even simpler data upload HTTP API for simple query construction Unification of data lake and file-like functions

Testing & Hardening

Revive coverage Fuzzer of data formats Fuzzer of network protocols Server-side AST query fuzzer Generic fuzzer for query text Randomization of DETACH/ATTACH in tests Integration with SQLSmith Embedded documentation

Experiments & Research

Multi-RAFT for Keeper MaterializedPostgreSQL (production readiness) SSH protocol for the server Support for PromQL Streaming queries Freeform text format Key-value data marts Decouple of columns and buffers Lazy reading of ranges Instant attaching tables from backups An object storage to borrow space from the filesystem cache COW disks ALTER PRIMARY KEY Autocompletion with language models Decentralized tables Unique Key Constraint

The roadmap covers the top focus items for both external contributors and full-time ClickHouse employees. The items marked with the :cloud: icon are meant for ClickHouse Cloud (proprietary). We expect 50..80% completion of the roadmap according to the results from previous years.

alanpaulkwan commented 6 months ago

"HTTP API for simple query construction" - would be really awesome if Python/R/DuckDB could read an arbitrary output / filtered table like S3 or just a file download. Amazing.

Also glad to see join reordering on the list still.

chenziliang commented 6 months ago

Would like to see streaming processing become main stream in ClickHouse :)

ahmed-adly-khalil commented 6 months ago

Would be great to see deeper NATS integration, mainly using JWT auth.

ucasfl commented 6 months ago

Support for Iceberg Data Catalog

Catalog is a good concept, but if want to introduce catalog into ClickHouse, we need refactor current metadata's structure, from Database -> Table to Catalog -> Database -> Table, then all tables created with builtin engine will under a internal_catalog, and we can have iceberg_catalog, hive_catalog. Not sure it's a good idea.

Besides, we still face the difficulty that we don't have Iceberg C++ API.

olly-writes-code commented 6 months ago

Hi! Adding a request for more GIS features. One important one is the ability to transform a geometry to a specified SRID. Redshift docs here

alexey-milovidov commented 6 months ago

@ucasfl, we can support mapping a specific database from the Iceberg data catalog as a database in ClickHouse. In this way, we don't have to map a whole catalog at once but allow doing it database by database.

alexey-milovidov commented 6 months ago

@ahmed-adly-khalil, while NATS is not on the main list (but nice to have), there are some items that we already started to do: https://github.com/ClickHouse/ClickHouse/issues/39459

alexey-milovidov commented 6 months ago

@alanpaulkwan, yes, in the simplest form, it represents a table like a file: https://github.com/ClickHouse/ClickHouse/issues/46925 but also allows to customize the result.

jrdi commented 6 months ago

Would be great to know if there are plans to keep working on improving zero-copy and Cloud Storage during 2024. I'm constantly seeing improvements and bug fixes which is super good. Will we see zero-copy ready for production this year?

Decoupling of object storages and metadata

Does this mean moving metadata from disks to Keeper or any other shared store?

alexey-milovidov commented 6 months ago

@jrdi

Would be great to know if there are plans to keep working on improving zero-copy and Cloud Storage during 2024. I'm constantly seeing improvements and bug fixes. Will we see zero-copy ready for production this year?

We have to fix the issues in zero-copy replication because it is still tested in CI, and used in production on older services in ClickHouse Cloud. For example, issues like this are found: https://github.com/ClickHouse/ClickHouse/pull/58333. But the track record of zero-copy replication is not good, and we expect to stop using it, then remove it from CI, and keep it on life support without further changes.

Does this mean moving metadata from disks to Keeper or any other shared store?

This is https://github.com/ClickHouse/ClickHouse/pull/58357

We currently have the following metadata options:

Metadata on local filesystem (s3).
No separate metadata (s3_plain).
Metadata in .index files in directories (web).
Metadata in a backup.
Metadata in Keeper (proprietary).
and more to come, e.g. https://github.com/ClickHouse/ClickHouse/issues/58347

And, we have the following object storage options:

S3.
HDFS.
Azure.
Web.
Local filesystem.
Borrowing space from the filesystem cache.

The task is to allow the cross-product of these options.

chenziliang commented 6 months ago

Would be great to see deeper NATS integration, mainly using JWT auth.

@ahmed-adly-khalil may I ask if you like streaming processing / analytics against NATS via ClickHouse ?

jrdi commented 6 months ago

Thanks, @alexey-milovidov!

But the track record of zero-copy replication is not good, and we expect to stop using it, then remove it from CI, and keep it on life support without further changes.

I can understand this decision but it's a pity. This mean that open source version won't have a productive method to separate compute and storage. Do you think this could change in the short/mid term? Even something like a plan with CH help and guidance on improvements that can be done to keep support by external contributors sounds better than keeping the feature out of the CI.

alexey-milovidov commented 6 months ago

It is not guaranteed and not in the plans, but we might have an implementation in the future- the only thing for sure is that it will not be based on zero-copy replication.

bputt-e commented 6 months ago

Unique Key Constraint would be great, could remove our deduplication step in our processing pipeline

mbtolou commented 6 months ago

Unique Key Constraint is great idea.

alanpaulkwan commented 6 months ago

I really like the unique key idea - hope it can follow ReplacingMergeTree and allow user to decide which row entry to keep. For me the options seem to be (1) incumbent data entry, (2) newest data entry, (3) an integer describing version priority. I've created arbitrary values to keep the "best value", which allows for non-standard logic like keeping the value that minimizes the difference in two timestamps with some case-by-case logic.

mwarkentin commented 5 months ago

@jrdi theres a proposal from Altinity here: https://github.com/ClickHouse/ClickHouse/issues/54644

earlev4 commented 5 months ago

A huge thank you to ClickHouse team and all the contributors for the amazing work on ClickHouse! I sincerely appreciate it.

Just curious. The Roadmap 2023 mentioned a "Recursive CTE" task, but I do not see it mentioned in the Roadmap 2024. Are plans to implement recursive CTEs in the future?

Thanks again!

alexey-milovidov commented 5 months ago

@earlev4, It was planned for the previous year after enabling Analyzer, but we didn't manage to enable Analyzer under that schedule, so I've added it as the major item for 2024, but I'm afraid to add recursive CTEs. We are considering it for implementation, but not on the list of main items.

earlev4 commented 5 months ago

Thanks so much, @alexey-milovidov! I sincerely appreciate the detailed response. It is very helpful. I am very grateful to you and the team for ClickHouse!

domainio commented 5 months ago

Hi, is there a plan to support Apache Iceberg writing with MERGE operation?

zheyu001 commented 5 months ago

Is there any chance to support iceberg v2? Or support evolved schema.

1392657590 commented 5 months ago

Do you have time to resolve In high-concurrency scenarios, the performance of ClickHouse Keeper is lower than that of ZooKeeper.We found this issue when replacing zk with keeper. The replacement plan has been temporarily suspended

JackyWoo commented 5 months ago

@1392657590 maybe you can try RaftKeeper

jiugem commented 5 months ago

I don't see MaterializedMySQL in the Roadmap

zhanglistar commented 5 months ago

@alexey-milovidov What about non-equal join? Any plan? Thanks.

chrisgoddard commented 4 months ago

Would be super excited to see production support for Vector search indices - especially on Clickhouse Cloud. Every week there seems like there's a different vector database and I can't wait until I can just use Clickhouse for everything

xevix commented 4 months ago

I would have major performance bottlenecks alleviated with materialized CTEs, to avoid DB roundtrips to create many intermediate results in tables. DuckDB recently added this last year which was great to see, was wondering if Clickhouse was thinking about this as well: https://github.com/ClickHouse/ClickHouse/issues/53449.

Fantastic work on recent releases, with usability features like ORDER BY ALL making it faster to query things ad hoc, and tight S3 integration with system credentials just magically pulled in 🎉

guoxiaolongzte commented 4 months ago

Support for Hive style partitioning.

Does it support dynamic hive partition writing?

guoxiaolongzte commented 4 months ago

Support for Iceberg Data Catalog

Catalog is a good concept, but if want to introduce catalog into ClickHouse, we need refactor current metadata's structure, from Database -> Table to Catalog -> Database -> Table, then all tables created with builtin engine will under a internal_catalog, and we can have iceberg_catalog, hive_catalog. Not sure it's a good idea.

Besides, we still face the difficulty that we don't have Iceberg C++ API.

@alexey-milovidov When do we plan to support the hive catalog and hudi catalog?

mingmwang commented 4 months ago

@1392657590 maybe you can try RaftKeeper

@JackyWoo

Do you have some benchmark data to share between the performance and throughput of ClickHouse Keeper and RaftKeeper ?

JackyWoo commented 4 months ago

@mingmwang we did not compare them right now, we only compare it with Zookeeper.

softiger commented 4 months ago

@JackyWoo could you share the comparison result between Zookeeper and RaftKeeper? I'm really interested in it, thanks!

JackyWoo commented 4 months ago

@softiger You can find it here. Let's talk about RaftKeeper here.

immelnikoff commented 3 months ago

I would like to see binary search for pre-ordered arrays in the next functions: has() hasAny() hasAll() arrayIntersect(). In general, I would like to see accelerated functions for pre-ordered arrays.

wordhardqi commented 3 months ago

CK could plan query optimizer for complex queries.

wordhardqi commented 3 months ago

Starrocks, Snowflake and Byconity have it.

Dileep-Dora commented 3 months ago

I see from 2023 roadmap inverted indices implementation is not a priority. https://github.com/ClickHouse/ClickHouse/pull/38667 .

are we considering this for this year or any other plans on improving text search performance.

alexey-milovidov commented 3 months ago

So far, there is a prototype implementation of inverted indices (you can unlock it with allow_experimental_inverted_index) - it is not ready and should not be used in production. It was not tested on realistic datasets.

Dileep-Dora commented 3 months ago

@alexey-milovidov thanks for the reply, yes we've tried this experimental feature but performance was not upto the mark. hence checking do we have any plans for this in 2024

johnpyp commented 2 months ago

Curious about prioritizing supporting ORDER BY optimizations for projections? This is the one thing holding my team back from using Clickhouse for WHERE query usecases where we want to replicate the ease-of-use and flexibility of traditional database indices.

We'd love to be able to create potentially many projections on top of one table with varied combinations of different ORDER BY query optimizations and WHERE query optimizations.

anvaari commented 1 month ago

Is there any plan to support deserializing Protobuf through a schema registry? It's needed.

callicles commented 1 month ago

On Full-test search indices - will that include things like TF-IDF, lemmatization, stop word removal etc...? (mentioned in the blog post)

ahmed-adly-khalil commented 4 weeks ago

can we please keep the JSON type https://clickhouse.com/docs/en/sql-reference/data-types/json ? it's been useful to us in many ways and it's sad it's going to be removed, ref: https://clickhousedb.slack.com/archives/CU478UEQZ/p1717653663312899

ahmed-adly-khalil commented 4 weeks ago

can we please keep the JSON type https://clickhouse.com/docs/en/sql-reference/data-types/json ? it's been useful to us in many ways and it's sad it's going to be removed, ref: https://clickhousedb.slack.com/archives/CU478UEQZ/p1717653663312899

correcting myself: there is amazing work is being done to revamp JSON type, details here: https://github.com/ClickHouse/ClickHouse/issues/54864

alexey-milovidov commented 5 days ago

@earlev4 Recursive CTEs have been added in 24.4.

dragon-wzl commented 20 hours ago

@alexey-milovidov Is Lightweight Updates v2 a feature of ClickHouse Cloud or the community version?

alexey-milovidov commented 9 hours ago

Lightweight updates are both for the Cloud and open-source.

ClickHouse / ClickHouse