ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
36.78k stars 6.8k forks source link

MergeTree over S3 improvements (RFC) #54644

Open alex-zaitsev opened 1 year ago

alex-zaitsev commented 1 year ago

Background

Object storage support for MergeTree tables has been added into ClickHouse in 2020 and evolved since then. The current implementation is described in the Double.Cloud article “How S3-based ClickHouse hybrid storage works under the hood”. We will use S3 as a synonym of object storage, but it also applies to GCS and Azure blob storage.

While S3 support has improved substantially in recent years there are still a number of problems with the current implementation (see also [3] and [4]):

ClickHouse Inc. made their own solution to this with the SharedMergeTree storage engine, which is not going to be released in open source.

The purpose of this document is to propose a solution that makes object storage support for MergeTree much better, and can be implemented by the open source ClickHouse community under Apache 2.0 license.

Requirements

We consider that MergeTree over S3 should meet the following high level requirements:

Additional requirements to consider:

Proposal

We propose to focus on two different tracks that can be executed in parallel:

  1. improving zero-copy replication, that builds on existing S3 disk design
  2. improving storage model for S3 data

We do not need to address the dynamic sharding that is also a feature of ClickHouse’s SharedMergeTree.

1. Improving zero-copy replication

The problem with current zero-copy replication is that it has to manage both replication for local metadata files in a traditional way, and zero-copy replication for data on object storage. Mixing those two in one solution is error prone. In order to make zero-copy replication more robust, S3 metadata needs to be moved from a local storage to Keeper. Here is how it can be done:

Since this may increase the amount of data stored in Keeper, compact metadata also needs to be implemented. That will also reduce S3 overheads (see [5])

We can keep the local storage metadata option for compatibility, and for single-node operation. Alternatively, since Keeper can be used in embedded mode, it can be used in single-node deployments as well, but ClickHouse would require more complex configuration.

2. Improving MergeTree over S3 storage model

The current MergeTree over S3 implementation is backed in S3 disk. We can not change it without breaking changes. There is another undocumented S3_plain disk that is better in the long term. S3_plain disk differs from S3 disk in that it stores data in exactly the same structure as it does in a local file system: file path matches the object path, so no local metadata files are needed. This has following implications:

We propose using S3_plain disk instead of S3, in order to address the s3 local metadata problem. The current implementation of S3_plain disk is very limited, and needs to be improved. The following changes are needed:

Storage level:

Replication:

The functionality should be generic and applied to other object storage types using corresponding APIs or s3proxy ([8])

References

  1. https://clickhouse.com/blog/concept-cloud-merge-tree-tables – Cloud MergeTree tables concept
  2. https://github.com/ClickHouse/ClickHouse/issues/13978 – Simplify ReplicatedMergeTree (RFC) (Yandex)
  3. https://gist.github.com/filimonov/75360ce79c4a73e6adfab76a3a5705d1 – S3 discussion (Altinity)
  4. https://docs.google.com/document/d/1sltWM2UJnAvtmYK_KMPvrKO9xB7PcHPfWsiOa7MbA14/edit – S3 Zero Copy replication RFC (Yandex Cloud)
  5. https://github.com/ClickHouse/ClickHouse/issues/46813 – Compact Metadata
  6. https://github.com/ClickHouse/ClickHouse/issues/48620 – SharedMetadataStorage community request
  7. https://github.com/ClickHouse/ClickHouse/issues/45766 – trivial support for re-sharding (RFC), in progress by Azat.
  8. https://github.com/gaul/s3proxy – s3 API proxy to over clouds, can be used for Azure and GCP
  9. https://github.com/ClickHouse/ClickHouse/pull/54567 – SharedMetadataStorage community PR
  10. https://github.com/ClickHouse/ClickHouse/issues/53620 – replica groups proposal

Appendix A. Feature compatibility for different MergeTree over S3 implementations

S3 S3_plain
metadata separate combined
can be restored from S3 only no yes
SELECT yes yes
INSERT yes yes
Merges yes yes
ALTER TABLE DELETE yes yes
ALTER TABLE UPDATE yes may require full data rewrite
Moves yes yes
Adding/removing column yes yes, w/o mutation
Adding/removing index and TTL yes yes, w/o mutation
Rename table yes yes, table is referenced by uuid
Rename column yes no, may require add/remove
Lightweight delete yes ?
danthegoodman1 commented 1 year ago

To chime in with somethings that I built IceDB (and the respective S3 proxy for):

  1. Decoupled read/write. I should by able to use a 1cpu4gb ram writer, but thousands of cores dynamically for reading
  2. true multitenancy: 1 table with 1M rows and 1M tables with 1 row shouldn't matter much in terms of how quick I can read or write from a given table. In addition, I should be able to execute user queries without thinking about security because I know that they can only access certain tables
  3. Open format: If there was a package to read the clickhouse metadata of a web table that I could bind to other languages that'd be amazing, but right now parquet is the winner here. Not only does everything understand it, but it's also easier for data export because I don't have to transfer rows to another format to allow bulk exports

I've recently come to the realization that what I am trying to build with ClickHouse as my query engine is sort of like BigQuery, but from Google's perspective, not a GCP user's. I want to use ClickHouse in a way where users can share tables, can scale reads to any number of users, and the only "idle" cost for a table is the storage.

The security mechanism has to be delivered through my S3 proxy, which works for both web tables and IceDB (my custom parquet merge engine). Even for s3_plain tables the S3 proxy can handle it, but right now that doesn't work well because it's not updatable. I've observed ludicrous performance from web tables in S3, and trying to get to that speed with Parquet (100M rows/s+)

Something clickhouse native to hit the above requirements would be amazing.

I certainly have a more esoteric use case than others, happy to expand on it in more detail if those targets are in alignment with others

danthegoodman1 commented 1 year ago

I will also add that S3-backed tables are egregious on S3 API calls, I have tables doing 11M API calls per day, API calls costing 100x more than the stored data (and they are large tables)

danthegoodman1 commented 1 year ago

I think it ideal to not require keeper for read replicas if we can choose to store data in s3 itself. Not sure what the query performance changes would be (if any). This makes running single node "read replicas" easier because all they need to do is access S3.

danthegoodman1 commented 1 year ago

Also, we need docs on s3_plain

lazywei commented 1 year ago

I'm curious how parts merging works in the s3_plain case - since we have transparent data structures, is it true that we don't ever need to update metadata because all we're doing is creating new parts and deleting old parts? or do we still incur some metadata operations that should be cached for better performance?

Another thing we noticed in current zero-copy replication approach is we often witness huge & slow parts merging and also not all the replicas are participating in the huge parts merging. Only a few of the replicas are having high CPU/RAM utilization while the rests are simply waiting to fetch.

hodgesrm commented 1 year ago

@danthegoodman1 I agree on the need for documentation for s3_plain. It's hard to reason about things you don't understand. The RFC has the clearest statement of its purpose that I've seen so far.

alifirat commented 1 year ago

Hey @alex-zaitsev

One purpose of using the s3 disk is about the performance using the prefixes. In your RFC, you're never mention what we should expect as performance using the s3_plain disk.

zheyu001 commented 1 year ago

This RFC itself makes sense to me, great writing! But just out of curiosity, why use blob storage? I'm seeing people in community talking about using clustered file system, which seems very interesting. It can potentially provide better performance comparing to S3 and also achieve the goal of splitting compute and storage(at least to some extent)

danthegoodman1 commented 1 year ago

Blob is far more ubiquitous, easier to use, and cheaper

danthegoodman1 commented 1 year ago

There’s also already a PR up for zero copy CFS metadata support

zheyu001 commented 1 year ago

Blob is far more ubiquitous, easier to use, and cheaper

Makes sense. Actually when I first attracted by CH is because of its speed. Blob storage (or S3 to be more specific) naturally introduces higher latency, which makes CH lost it's advantage. Though throwing a bunch of machines can ease the case to some extent. And it reminds me Trino.

But I'm not trying to say which solution is better. I'm pretty sure both solutions have their own pros and cons. Really looking forward to see how both solutions unfold!

danthegoodman1 commented 1 year ago

CH with S3 is orders of magnitude quicker than trinio to start a query. Clickhouse cloud runs entirely on S3 backed merge trees, so it definitely doesn’t lose its advantage there. Maybe worst case add 100ms plus query is 2-3x slower, but with disk caching you can still keep your MV queries in single digit millisecond range

mga-chka commented 1 year ago

From the tests we did at contentsquare. S3 + a cache on high-throughput disks (like SSD NVME) gives a much better performance/cost ratio than clickhouse@HDD. If you're more interested, @alifirat did a presentation of some of these benchmarks during last week clickhouse meetup. The slides should soon be uploaded in https://github.com/ClickHouse/clickhouse-presentations

zheyu001 commented 1 year ago

S3 + a cache on high-throughput disks (like SSD NVME) gives a much better performance/cost ratio than clickhouse@HDD

This is very interesting! Looking forward to see the presentation slides! Meanwhile, I'm really curious about the caching strategy and the data granularity ( a few hundreds MB per object?)

alifirat commented 1 year ago

Hey @zheyu001

I didn't have all the implementation details but I guess is put in cache some informations related to the parts (compressed or uncompressed) so I think it's depending of your table and your partition size.

zheyu001 commented 1 year ago

put in cache some informations related to the parts (compressed or uncompressed)

This definitely can help CH decide which parts to download to local disk, which reduces overhead of S3 API calls. But for a data part (say around 200MB) on S3, to download it, the latency various from ~50 ms to a few seconds based on network condition. However, CH only takes < 20 ms(or even less) to process such a data part. If the cache strategy is bad enough, CH needs to download S3 data parts for each query, that makes the data processing speed of CH seems less important as in most of the time, it is waiting for downloading data.

Even in an average case, every query only has 1% chance to be delayed because of downloading data, this uncertainty of latency jitter probably makes user doubt about the stability of the system, especially after they were so exciting about the performance using HDD/SSD.

zheyu001 commented 1 year ago

Actually, I just realized this doc mentioned in the RFC. Let me dive deeper there, since the current conversation started to drift away from the RFC itself.

hodgesrm commented 1 year ago

Makes sense. Actually when I first attracted by CH is because of its speed. Blob storage (or S3 to be more specific) naturally introduces higher latency, which makes CH lost it's advantage.

A lot of users share your concern.

  1. S3 is not necessarily slower. Network transfer rates compare favorably with EBS in many cases; the big issue is that S3 APIs don't natively use the OS Buffer Cache. That's partly solved by the disk_cache in its current state. It could get a lot better, for example, by query distribution to have each server query disjoint parts so as to avoid repeatedly loading the same data across all caches.
  2. S3 is in high-end use cases vastly more cost-efficient since (a) it's 4-5x less expensive than EBS, (b) unlimited unlike attached SSD, and (c) internally replicated so you don't have extra full copies.

I like the s3_plain direction because it promises a single format that works across all storage types, which means users can pick what they want.

mattnthat commented 11 months ago

This feature feels critical to the continued success of Clickhouse.

S3 is the standard for big data storage, its broadly supported across not just hyperscale providers but independant cloud/network infrastructure providers (Rackcorp, Cloudflare, inc open source like Minio) and on premise data storage vendors (eg netapp, hitachi etc).

Having this separation of compute and storage is the ideal world in terms of infrastructure architecture.

Happy to help with beta testing or anything related to progressing this feature - How do we progress this work item?

makeavish commented 8 months ago

Reduce the number of S3 operations, which can drive costs up unnecessarily.

@alex-zaitsev How s3_disk will solve this?

zheyu001 commented 8 months ago

For EBS users, it seems with S3 CSI driver, s3_plain_disk is not a blocker any more? Even better, with S3 express one zone, user can even get decent performance.

alex-zaitsev commented 4 months ago

Phase 1 continuation: https://github.com/ClickHouse/ClickHouse/issues/62936

alex-zaitsev commented 4 months ago

Reduce the number of S3 operations, which can drive costs up unnecessarily.

@alex-zaitsev How s3_disk will solve this?

Disk itself can not solve it. One of approaches is compact metadata format, that would reduce number of objects on S3 and therefore number of S3 requests. It has been started in https://github.com/ClickHouse/ClickHouse/pull/54997 but have not been completed yet.

alex-zaitsev commented 4 months ago

s3_plain_rewritable (https://github.com/ClickHouse/ClickHouse/pull/61116) addresses part of Phase 2 issues. See https://clickhouse.com/docs/en/operations/storing-data#s3-plain-rewritable-storage

jirislav commented 2 months ago

For EBS users, it seems with S3 CSI driver, s3_plain_disk is not a blocker any more?

This is not a good solution if you want to support high availability by running multiple replicas. You would need to either share the disk among multiple replicas (I don't think that's supported) or replicate the data on S3 itself, which just multiplies the costs of S3 & operations.

jirislav commented 2 months ago

@alex-zaitsev What do you think about creating separate feature requests for each of the 21 proposed steps in your proposal? I don't see any relevant comments to why we shouldn't progress forward with this RFC. I also believe it would be easier for new contributors to join this "proper S3 support" journey by focusing only on a subset of the problem.