crate / crate

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Lucene.
https://cratedb.com/product
Apache License 2.0
4k stars 550 forks source link

Improve primary shards balancing/reduce primary shard write overhead #15919

Open hlcianfagna opened 2 months ago

hlcianfagna commented 2 months ago

Problem Statement

With the current shard allocation and balancing mechanisms, it is possible to have a situation where, for instance, given a 2-nodes cluster and a table with 4 shards and 1 replica, 3 primaries and 1 replica go to one node and 3 replicas and 1 primary to the other, instead of 2 primaries and 2 replicas on each. In the large majority of cases this is not a problem, but in very busy systems, ingestion degradation of up to 25% can be seen as nodes with more primary shards will get fully utilized on the CPU while nodes with less primary but mostly replica shards aren't.

The main reason that primary shards aren't evenly balanced relates to that all of the current balancing logic (and related settings like cluster.routing.allocation.balance.index/shard) does not distinguish between a primary and a replica shard. Additionally, the available settings to control the cluster/index.total_shards_per_node will also not distinguish between a primary and a replica, but using these settings for shard balancing would be a kind of a workaround anyhow as the intention for these settings is more of a protection than a control mechanism and also can lead to a situation where no shards can be allocated at all.

Related to https://github.com/elastic/elasticsearch/issues/41543, https://github.com/elastic/elasticsearch/issues/17213, https://github.com/crate/crate/issues/14594.

Possible Solutions

  1. Reduce primary write load so it will be almost the same as a replica write
  2. Introduce primary-only related balancing, e.g. backport related changes of OpenSearch (as they do segment-based replication, their primaries will have more load in general)
  3. Improve balancing logic to take write load into account similar like Elasticsearch did: https://github.com/elastic/elasticsearch/pull/91603 although this seems to target a bit of a different problem of hot nodes in general and not related to primary vs. replica shard distribution.

Considered Alternatives

hlcianfagna commented 2 months ago

The following may also be relevant: