alex-zaitsev commented 1 year ago

Background

Object storage support for MergeTree tables has been added into ClickHouse in 2020 and evolved since then. The current implementation is described in the Double.Cloud article “How S3-based ClickHouse hybrid storage works under the hood”. We will use S3 as a synonym of object storage, but it also applies to GCS and Azure blob storage.

While S3 support has improved substantially in recent years there are still a number of problems with the current implementation (see also [3] and [4]):

Data is stored in two places: local metadata files and S3 objects.
Data stored in S3 is not self-contained, i.e. it is not possible to attach table stored in S3 without the local metadata data files
Every modification requires synchronization between 2 different non-transactional media: local metadata files on a local disk and the data itself stored in object storage. That leads to consistency problems.
Because of the above, zero-copy replication is also not reliable, and is known for bugs
Backups are not trivial since two different sources need to be backed up separately

ClickHouse Inc. made their own solution to this with the SharedMergeTree storage engine, which is not going to be released in open source.

The purpose of this document is to propose a solution that makes object storage support for MergeTree much better, and can be implemented by the open source ClickHouse community under Apache 2.0 license.

Requirements

We consider that MergeTree over S3 should meet the following high level requirements:

It should store MergeTree table data (fully or selected parts/partitions) in object storage
It should allow to read S3 MergeTree table from multiple replicas
It should allow to write into S3 MergeTree table from multiple replicas
It should be self-contained, i.e. all the data and metadata should be stored in one bucket
It should build incrementally on existing ClickHouse functionality
The solution should be cloud provider agnostic, and work for GCP and Azure as well (s3proxy may be used as an integration layer).

Additional requirements to consider:

Ability to distribute merges between replicas (see also “worker-replicas” proposal https://github.com/ClickHouse/ClickHouse/issues/53620)
Reduce the number of S3 operations, which can drive costs up unnecessarily.

Proposal

We propose to focus on two different tracks that can be executed in parallel:

improving zero-copy replication, that builds on existing S3 disk design
improving storage model for S3 data

We do not need to address the dynamic sharding that is also a feature of ClickHouse’s SharedMergeTree.

1. Improving zero-copy replication

The problem with current zero-copy replication is that it has to manage both replication for local metadata files in a traditional way, and zero-copy replication for data on object storage. Mixing those two in one solution is error prone. In order to make zero-copy replication more robust, S3 metadata needs to be moved from a local storage to Keeper. Here is how it can be done:

Make metadata storage configurable, allow S3 metadata to be stored locally (for backward compatibility) or in Keeper
When stored in Keeper, all part related operations should be done in a single Keeper transaction, that ensures that parts metadata and S3 are in sync
When S3 metadata is stored in Keeper, zero-copy replication becomes simpler, since no replication is needed
Make periodic metadata snapshot to S3 bucket, so the table can be mounted/restored in case of metadata is lost in local file system and keeper
Allow migration from local metadata to Keeper metadata. E.g. if Keeper metadata is configured, and ClickHouse to starts without metadata – it can read from local storage and populate Keeper

Since this may increase the amount of data stored in Keeper, compact metadata also needs to be implemented. That will also reduce S3 overheads (see [5])

We can keep the local storage metadata option for compatibility, and for single-node operation. Alternatively, since Keeper can be used in embedded mode, it can be used in single-node deployments as well, but ClickHouse would require more complex configuration.

2. Improving MergeTree over S3 storage model

The current MergeTree over S3 implementation is backed in S3 disk. We can not change it without breaking changes. There is another undocumented S3_plain disk that is better in the long term. S3_plain disk differs from S3 disk in that it stores data in exactly the same structure as it does in a local file system: file path matches the object path, so no local metadata files are needed. This has following implications:

It is easy to manage since all the data is in one place
S3_plain disk can be transparently attached to any ClickHouse instance, since no extra metadata is required
As a side feature – S3_plain can be attached to any ClickHouse in read-only node for testing purposes (e.g. version upgrades)
The data modifications for S3_plain disk have to be limited compared to a file system. In particular, some mutations and renames can not be easily done. See Appendix A for compatibility matrix.
Backup is also much easier to implement. E.g. even S3 bucket versioning can be used.
It can open the door to running ClickHouse lambdas over S3 data

We propose using S3_plain disk instead of S3, in order to address the s3 local metadata problem. The current implementation of S3_plain disk is very limited, and needs to be improved. The following changes are needed:

Storage level:

Extend ClickHouse storage model to work with storages that do not support hard links – those storages may not support all the functionality, but most of it.
Allow to execute MOVE without using hard links. Instead, some flag files can be used to mark the completion. This will make S3_disk usable for cold partitions, and also allow migrating existing data from S3 disk.
Implement operations that re-write the part completely: INSERT, MERGE, ALTER TABLE DELETE
Implement adding a column/index in-place that can be done without hard links/renames
Implement adding/removing TTL, MATERIALIZE TTL
Allow renaming a column in place (without renaming the part) – optional
Implement ALTER TABLE UPDATE using copy or server-side copy.

Replication:

Integrate S3_plain with zero-copy replication. In this case, metadata in Keeper serves as a cache during ClickHouse start.
Alternatively, S3 metadata can be removed completely. Instead, the S3 prefix that corresponds to part can be stored directly in the part node that exists in Keeper already. It simplifies the replication protocol and reduces the amount of data in Keeper.
Could be covered by this https://github.com/ClickHouse/ClickHouse/pull/53629/ – Zero copy support NFS.

The functionality should be generic and applied to other object storage types using corresponding APIs or s3proxy ([8])

References

https://clickhouse.com/blog/concept-cloud-merge-tree-tables – Cloud MergeTree tables concept
https://github.com/ClickHouse/ClickHouse/issues/13978 – Simplify ReplicatedMergeTree (RFC) (Yandex)
https://gist.github.com/filimonov/75360ce79c4a73e6adfab76a3a5705d1 – S3 discussion (Altinity)
https://docs.google.com/document/d/1sltWM2UJnAvtmYK_KMPvrKO9xB7PcHPfWsiOa7MbA14/edit – S3 Zero Copy replication RFC (Yandex Cloud)
https://github.com/ClickHouse/ClickHouse/issues/46813 – Compact Metadata
https://github.com/ClickHouse/ClickHouse/issues/48620 – SharedMetadataStorage community request
https://github.com/ClickHouse/ClickHouse/issues/45766 – trivial support for re-sharding (RFC), in progress by Azat.
https://github.com/gaul/s3proxy – s3 API proxy to over clouds, can be used for Azure and GCP
https://github.com/ClickHouse/ClickHouse/pull/54567 – SharedMetadataStorage community PR
https://github.com/ClickHouse/ClickHouse/issues/53620 – replica groups proposal

Appendix A. Feature compatibility for different MergeTree over S3 implementations

	S3	S3_plain
metadata	separate	combined
can be restored from S3 only	no	yes
SELECT	yes	yes
INSERT	yes	yes
Merges	yes	yes
ALTER TABLE DELETE	yes	yes
ALTER TABLE UPDATE	yes	may require full data rewrite
Moves	yes	yes
Adding/removing column	yes	yes, w/o mutation
Adding/removing index and TTL	yes	yes, w/o mutation
Rename table	yes	yes, table is referenced by uuid
Rename column	yes	no, may require add/remove
Lightweight delete	yes	?

danthegoodman1 commented 1 year ago

To chime in with somethings that I built IceDB (and the respective S3 proxy for):

Decoupled read/write. I should by able to use a 1cpu4gb ram writer, but thousands of cores dynamically for reading
true multitenancy: 1 table with 1M rows and 1M tables with 1 row shouldn't matter much in terms of how quick I can read or write from a given table. In addition, I should be able to execute user queries without thinking about security because I know that they can only access certain tables
Open format: If there was a package to read the clickhouse metadata of a web table that I could bind to other languages that'd be amazing, but right now parquet is the winner here. Not only does everything understand it, but it's also easier for data export because I don't have to transfer rows to another format to allow bulk exports

I've recently come to the realization that what I am trying to build with ClickHouse as my query engine is sort of like BigQuery, but from Google's perspective, not a GCP user's. I want to use ClickHouse in a way where users can share tables, can scale reads to any number of users, and the only "idle" cost for a table is the storage.

The security mechanism has to be delivered through my S3 proxy, which works for both web tables and IceDB (my custom parquet merge engine). Even for s3_plain tables the S3 proxy can handle it, but right now that doesn't work well because it's not updatable. I've observed ludicrous performance from web tables in S3, and trying to get to that speed with Parquet (100M rows/s+)

Something clickhouse native to hit the above requirements would be amazing.

I certainly have a more esoteric use case than others, happy to expand on it in more detail if those targets are in alignment with others

danthegoodman1 commented 1 year ago

I will also add that S3-backed tables are egregious on S3 API calls, I have tables doing 11M API calls per day, API calls costing 100x more than the stored data (and they are large tables)

danthegoodman1 commented 1 year ago

I think it ideal to not require keeper for read replicas if we can choose to store data in s3 itself. Not sure what the query performance changes would be (if any). This makes running single node "read replicas" easier because all they need to do is access S3.

danthegoodman1 commented 1 year ago

Also, we need docs on s3_plain

lazywei commented 1 year ago

I'm curious how parts merging works in the s3_plain case - since we have transparent data structures, is it true that we don't ever need to update metadata because all we're doing is creating new parts and deleting old parts? or do we still incur some metadata operations that should be cached for better performance?

Another thing we noticed in current zero-copy replication approach is we often witness huge & slow parts merging and also not all the replicas are participating in the huge parts merging. Only a few of the replicas are having high CPU/RAM utilization while the rests are simply waiting to fetch.

hodgesrm commented 1 year ago

@danthegoodman1 I agree on the need for documentation for s3_plain. It's hard to reason about things you don't understand. The RFC has the clearest statement of its purpose that I've seen so far.

alifirat commented 1 year ago

Hey @alex-zaitsev

One purpose of using the s3 disk is about the performance using the prefixes. In your RFC, you're never mention what we should expect as performance using the s3_plain disk.

zheyu001 commented 1 year ago

This RFC itself makes sense to me, great writing! But just out of curiosity, why use blob storage? I'm seeing people in community talking about using clustered file system, which seems very interesting. It can potentially provide better performance comparing to S3 and also achieve the goal of splitting compute and storage(at least to some extent)

danthegoodman1 commented 1 year ago

Blob is far more ubiquitous, easier to use, and cheaper

danthegoodman1 commented 1 year ago

There’s also already a PR up for zero copy CFS metadata support

zheyu001 commented 1 year ago

Blob is far more ubiquitous, easier to use, and cheaper

Makes sense. Actually when I first attracted by CH is because of its speed. Blob storage (or S3 to be more specific) naturally introduces higher latency, which makes CH lost it's advantage. Though throwing a bunch of machines can ease the case to some extent. And it reminds me Trino.

But I'm not trying to say which solution is better. I'm pretty sure both solutions have their own pros and cons. Really looking forward to see how both solutions unfold!

danthegoodman1 commented 1 year ago

CH with S3 is orders of magnitude quicker than trinio to start a query. Clickhouse cloud runs entirely on S3 backed merge trees, so it definitely doesn’t lose its advantage there. Maybe worst case add 100ms plus query is 2-3x slower, but with disk caching you can still keep your MV queries in single digit millisecond range

mga-chka commented 1 year ago

From the tests we did at contentsquare. S3 + a cache on high-throughput disks (like SSD NVME) gives a much better performance/cost ratio than clickhouse@HDD. If you're more interested, @alifirat did a presentation of some of these benchmarks during last week clickhouse meetup. The slides should soon be uploaded in https://github.com/ClickHouse/clickhouse-presentations

zheyu001 commented 1 year ago

S3 + a cache on high-throughput disks (like SSD NVME) gives a much better performance/cost ratio than clickhouse@HDD

This is very interesting! Looking forward to see the presentation slides! Meanwhile, I'm really curious about the caching strategy and the data granularity ( a few hundreds MB per object?)

alifirat commented 1 year ago

Hey @zheyu001

I didn't have all the implementation details but I guess is put in cache some informations related to the parts (compressed or uncompressed) so I think it's depending of your table and your partition size.

zheyu001 commented 1 year ago

put in cache some informations related to the parts (compressed or uncompressed)

This definitely can help CH decide which parts to download to local disk, which reduces overhead of S3 API calls. But for a data part (say around 200MB) on S3, to download it, the latency various from ~50 ms to a few seconds based on network condition. However, CH only takes < 20 ms(or even less) to process such a data part. If the cache strategy is bad enough, CH needs to download S3 data parts for each query, that makes the data processing speed of CH seems less important as in most of the time, it is waiting for downloading data.

Even in an average case, every query only has 1% chance to be delayed because of downloading data, this uncertainty of latency jitter probably makes user doubt about the stability of the system, especially after they were so exciting about the performance using HDD/SSD.

zheyu001 commented 1 year ago

Actually, I just realized this doc mentioned in the RFC. Let me dive deeper there, since the current conversation started to drift away from the RFC itself.

hodgesrm commented 1 year ago

Makes sense. Actually when I first attracted by CH is because of its speed. Blob storage (or S3 to be more specific) naturally introduces higher latency, which makes CH lost it's advantage.

A lot of users share your concern.

S3 is not necessarily slower. Network transfer rates compare favorably with EBS in many cases; the big issue is that S3 APIs don't natively use the OS Buffer Cache. That's partly solved by the disk_cache in its current state. It could get a lot better, for example, by query distribution to have each server query disjoint parts so as to avoid repeatedly loading the same data across all caches.
S3 is in high-end use cases vastly more cost-efficient since (a) it's 4-5x less expensive than EBS, (b) unlimited unlike attached SSD, and (c) internally replicated so you don't have extra full copies.

I like the s3_plain direction because it promises a single format that works across all storage types, which means users can pick what they want.

mattnthat commented 11 months ago

This feature feels critical to the continued success of Clickhouse.

S3 is the standard for big data storage, its broadly supported across not just hyperscale providers but independant cloud/network infrastructure providers (Rackcorp, Cloudflare, inc open source like Minio) and on premise data storage vendors (eg netapp, hitachi etc).

Having this separation of compute and storage is the ideal world in terms of infrastructure architecture.

Happy to help with beta testing or anything related to progressing this feature - How do we progress this work item?

makeavish commented 8 months ago

Reduce the number of S3 operations, which can drive costs up unnecessarily.

@alex-zaitsev How s3_disk will solve this?

zheyu001 commented 8 months ago

For EBS users, it seems with S3 CSI driver, s3_plain_disk is not a blocker any more? Even better, with S3 express one zone, user can even get decent performance.

alex-zaitsev commented 4 months ago

Phase 1 continuation: https://github.com/ClickHouse/ClickHouse/issues/62936

alex-zaitsev commented 4 months ago

Reduce the number of S3 operations, which can drive costs up unnecessarily.

@alex-zaitsev How s3_disk will solve this?

Disk itself can not solve it. One of approaches is compact metadata format, that would reduce number of objects on S3 and therefore number of S3 requests. It has been started in https://github.com/ClickHouse/ClickHouse/pull/54997 but have not been completed yet.

alex-zaitsev commented 4 months ago

s3_plain_rewritable (https://github.com/ClickHouse/ClickHouse/pull/61116) addresses part of Phase 2 issues. See https://clickhouse.com/docs/en/operations/storing-data#s3-plain-rewritable-storage

jirislav commented 2 months ago

For EBS users, it seems with S3 CSI driver, s3_plain_disk is not a blocker any more?

This is not a good solution if you want to support high availability by running multiple replicas. You would need to either share the disk among multiple replicas (I don't think that's supported) or replicate the data on S3 itself, which just multiplies the costs of S3 & operations.

jirislav commented 2 months ago

@alex-zaitsev What do you think about creating separate feature requests for each of the 21 proposed steps in your proposal? I don't see any relevant comments to why we shouldn't progress forward with this RFC. I also believe it would be easier for new contributors to join this "proper S3 support" journey by focusing only on a subset of the problem.

ClickHouse / ClickHouse

MergeTree over S3 improvements (RFC) #54644

Background

Requirements

Proposal

1. Improving zero-copy replication

2. Improving MergeTree over S3 storage model

References

Appendix A. Feature compatibility for different MergeTree over S3 implementations