Open alexey-milovidov opened 6 months ago
"HTTP API for simple query construction" - would be really awesome if Python/R/DuckDB could read an arbitrary output / filtered table like S3 or just a file download. Amazing.
Also glad to see join reordering on the list still.
Would like to see streaming processing become main stream in ClickHouse :)
Would be great to see deeper NATS integration, mainly using JWT auth.
Support for Iceberg Data Catalog
Catalog
is a good concept, but if want to introduce catalog
into ClickHouse, we need refactor current metadata's structure, from Database -> Table
to Catalog -> Database -> Table
, then all tables created with builtin engine will under a internal_catalog
, and we can have iceberg_catalog
, hive_catalog
. Not sure it's a good idea.
Besides, we still face the difficulty that we don't have Iceberg C++ API.
Hi! Adding a request for more GIS features. One important one is the ability to transform a geometry to a specified SRID. Redshift docs here
@ucasfl, we can support mapping a specific database from the Iceberg data catalog as a database in ClickHouse. In this way, we don't have to map a whole catalog at once but allow doing it database by database.
@ahmed-adly-khalil, while NATS is not on the main list (but nice to have), there are some items that we already started to do: https://github.com/ClickHouse/ClickHouse/issues/39459
@alanpaulkwan, yes, in the simplest form, it represents a table like a file: https://github.com/ClickHouse/ClickHouse/issues/46925 but also allows to customize the result.
Would be great to know if there are plans to keep working on improving zero-copy and Cloud Storage during 2024. I'm constantly seeing improvements and bug fixes which is super good. Will we see zero-copy ready for production this year?
Decoupling of object storages and metadata
Does this mean moving metadata from disks to Keeper or any other shared store?
@jrdi
Would be great to know if there are plans to keep working on improving zero-copy and Cloud Storage during 2024. I'm constantly seeing improvements and bug fixes. Will we see zero-copy ready for production this year?
We have to fix the issues in zero-copy replication because it is still tested in CI, and used in production on older services in ClickHouse Cloud. For example, issues like this are found: https://github.com/ClickHouse/ClickHouse/pull/58333. But the track record of zero-copy replication is not good, and we expect to stop using it, then remove it from CI, and keep it on life support without further changes.
Does this mean moving metadata from disks to Keeper or any other shared store?
This is https://github.com/ClickHouse/ClickHouse/pull/58357
We currently have the following metadata options:
And, we have the following object storage options:
The task is to allow the cross-product of these options.
Would be great to see deeper NATS integration, mainly using JWT auth.
@ahmed-adly-khalil may I ask if you like streaming processing / analytics against NATS via ClickHouse ?
Thanks, @alexey-milovidov!
But the track record of zero-copy replication is not good, and we expect to stop using it, then remove it from CI, and keep it on life support without further changes.
I can understand this decision but it's a pity. This mean that open source version won't have a productive method to separate compute and storage. Do you think this could change in the short/mid term? Even something like a plan with CH help and guidance on improvements that can be done to keep support by external contributors sounds better than keeping the feature out of the CI.
It is not guaranteed and not in the plans, but we might have an implementation in the future- the only thing for sure is that it will not be based on zero-copy replication.
Unique Key Constraint
would be great, could remove our deduplication step in our processing pipeline
Unique Key Constraint
is great idea.
I really like the unique key idea - hope it can follow ReplacingMergeTree and allow user to decide which row entry to keep. For me the options seem to be (1) incumbent data entry, (2) newest data entry, (3) an integer describing version priority. I've created arbitrary values to keep the "best value", which allows for non-standard logic like keeping the value that minimizes the difference in two timestamps with some case-by-case logic.
@jrdi theres a proposal from Altinity here: https://github.com/ClickHouse/ClickHouse/issues/54644
A huge thank you to ClickHouse team and all the contributors for the amazing work on ClickHouse! I sincerely appreciate it.
Just curious. The Roadmap 2023 mentioned a "Recursive CTE" task, but I do not see it mentioned in the Roadmap 2024. Are plans to implement recursive CTEs in the future?
Thanks again!
@earlev4, It was planned for the previous year after enabling Analyzer, but we didn't manage to enable Analyzer under that schedule, so I've added it as the major item for 2024, but I'm afraid to add recursive CTEs. We are considering it for implementation, but not on the list of main items.
Thanks so much, @alexey-milovidov! I sincerely appreciate the detailed response. It is very helpful. I am very grateful to you and the team for ClickHouse!
Hi, is there a plan to support Apache Iceberg writing with MERGE operation?
Is there any chance to support iceberg v2? Or support evolved schema.
Do you have time to resolve In high-concurrency scenarios, the performance of ClickHouse Keeper is lower than that of ZooKeeper.We found this issue when replacing zk with keeper. The replacement plan has been temporarily suspended
@1392657590 maybe you can try RaftKeeper
I don't see MaterializedMySQL in the Roadmap
@alexey-milovidov What about non-equal join? Any plan? Thanks.
Would be super excited to see production support for Vector search indices - especially on Clickhouse Cloud. Every week there seems like there's a different vector database and I can't wait until I can just use Clickhouse for everything
I would have major performance bottlenecks alleviated with materialized CTEs, to avoid DB roundtrips to create many intermediate results in tables. DuckDB recently added this last year which was great to see, was wondering if Clickhouse was thinking about this as well: https://github.com/ClickHouse/ClickHouse/issues/53449.
Fantastic work on recent releases, with usability features like ORDER BY ALL
making it faster to query things ad hoc, and tight S3 integration with system credentials just magically pulled in 🎉
Support for Hive style partitioning.
Does it support dynamic hive partition writing?
Support for Iceberg Data Catalog
Catalog
is a good concept, but if want to introducecatalog
into ClickHouse, we need refactor current metadata's structure, fromDatabase -> Table
toCatalog -> Database -> Table
, then all tables created with builtin engine will under ainternal_catalog
, and we can haveiceberg_catalog
,hive_catalog
. Not sure it's a good idea.Besides, we still face the difficulty that we don't have Iceberg C++ API.
@alexey-milovidov When do we plan to support the hive catalog and hudi catalog?
@1392657590 maybe you can try RaftKeeper
@JackyWoo
Do you have some benchmark data to share between the performance and throughput of ClickHouse Keeper and RaftKeeper ?
@mingmwang we did not compare them right now, we only compare it with Zookeeper.
@JackyWoo could you share the comparison result between Zookeeper and RaftKeeper? I'm really interested in it, thanks!
I would like to see binary search for pre-ordered arrays in the next functions: has() hasAny() hasAll() arrayIntersect(). In general, I would like to see accelerated functions for pre-ordered arrays.
CK could plan query optimizer for complex queries.
Starrocks, Snowflake and Byconity have it.
I see from 2023 roadmap inverted indices implementation is not a priority. https://github.com/ClickHouse/ClickHouse/pull/38667 .
are we considering this for this year or any other plans on improving text search performance.
So far, there is a prototype implementation of inverted indices (you can unlock it with allow_experimental_inverted_index
) - it is not ready and should not be used in production. It was not tested on realistic datasets.
@alexey-milovidov thanks for the reply, yes we've tried this experimental feature but performance was not upto the mark. hence checking do we have any plans for this in 2024
Curious about prioritizing supporting ORDER BY
optimizations for projections? This is the one thing holding my team back from using Clickhouse for WHERE
query usecases where we want to replicate the ease-of-use and flexibility of traditional database indices.
We'd love to be able to create potentially many projections on top of one table with varied combinations of different ORDER BY query optimizations and WHERE query optimizations.
Is there any plan to support deserializing Protobuf through a schema registry? It's needed.
On Full-test search indices - will that include things like TF-IDF, lemmatization, stop word removal etc...? (mentioned in the blog post)
can we please keep the JSON type https://clickhouse.com/docs/en/sql-reference/data-types/json ? it's been useful to us in many ways and it's sad it's going to be removed, ref: https://clickhousedb.slack.com/archives/CU478UEQZ/p1717653663312899
can we please keep the JSON type https://clickhouse.com/docs/en/sql-reference/data-types/json ? it's been useful to us in many ways and it's sad it's going to be removed, ref: https://clickhousedb.slack.com/archives/CU478UEQZ/p1717653663312899
correcting myself: there is amazing work is being done to revamp JSON type, details here: https://github.com/ClickHouse/ClickHouse/issues/54864
@earlev4 Recursive CTEs have been added in 24.4.
@alexey-milovidov Is Lightweight Updates v2 a feature of ClickHouse Cloud or the community version?
Lightweight updates are both for the Cloud and open-source.
This is ClickHouse roadmap 2024. Descriptions and links are to be filled.
This roadmap does not cover the tasks related to infrastructure, orchestration, documentation, marketing, external integrations, drivers, etc.
See also:
Roadmap 2023: https://github.com/ClickHouse/ClickHouse/issues/44767 Roadmap 2022: https://github.com/ClickHouse/ClickHouse/issues/32513 Roadmap 2021: https://github.com/ClickHouse/ClickHouse/issues/17623 Roadmap 2020: link
SQL Compatibility
✔️ Enable Analyzer by default Non-constant CASE, non-constant IN Remove old predicate pushdown mechanics Correlated subqueries with decorrelation Transforming anti-join: LEFT JOIN ... WHERE ... IS NULL to NOT IN Deriving index condition from the right-hand side of INNER JOIN JOINs reordering and extended pushdown Time data type
Data Storage
✔️ Userspace page cache ✔️ Adaptive mode for asynchronous inserts ✔️ Semistructured Data: Variant data type Semistructured Data: Sharded Maps Semistructured Data: JSON data type Transactions for Replicated tables Lightweight Updates v2 Uniform treatment of LowCardinality, Sparse, and Const columns Settings to control the consistency of projections on updates Replicated Catalog :cloud: On-disk storage for Keeper Query cache on disk ✔️ Decoupling of object storages and metadata Full-text indices (production readiness) Vector search indices (production readiness)
Security, access control, and isolation
✔️ Definers (encapsulation of access control) for views Warnings and limits on the number of database objects Dynamic configuration of query handlers JWT authentication :cloud: Data masking in row-level security :cloud: Secure storage for named collections :cloud: Cancellation points for long operations Resource scheduler (continuation)
Query Processing
Parallel replicas with task callbacks (production readiness) Parallel replicas with parallel distributed INSERT SELECT Automatic usage of -Cluster table functions Adaptive thresholds for data spilling on disk Optimization with subcolumns by default
Interfaces & External Data
Support for Iceberg Data Catalog Support for Hive-style partitioning Explicit queries in external tables Even simpler data upload HTTP API for simple query construction Unification of data lake and file-like functions
Testing & Hardening
Revive coverage Fuzzer of data formats Fuzzer of network protocols Server-side AST query fuzzer Generic fuzzer for query text Randomization of DETACH/ATTACH in tests Integration with SQLSmith Embedded documentation
Experiments & Research
Multi-RAFT for Keeper MaterializedPostgreSQL (production readiness) SSH protocol for the server Support for PromQL Streaming queries Freeform text format Key-value data marts Decouple of columns and buffers Lazy reading of ranges Instant attaching tables from backups An object storage to borrow space from the filesystem cache COW disks ALTER PRIMARY KEY Autocompletion with language models Decentralized tables Unique Key Constraint
The roadmap covers the top focus items for both external contributors and full-time ClickHouse employees. The items marked with the :cloud: icon are meant for ClickHouse Cloud (proprietary). We expect 50..80% completion of the roadmap according to the results from previous years.