Zone-aware ring checks can fail when an entire zone is unavailable

cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

https://cortexmetrics.io/

Apache License 2.0

5.47k stars 799 forks source link

Zone-aware ring checks can fail when an entire zone is unavailable #4291

Open jtlisi opened 3 years ago

jtlisi commented 3 years ago

Describe the bug

If all the instances in a zone are in the LEAVING state but all of the other zones are fully available, ring checks will return errors.

To Reproduce

This can be replicated but updating the following test:

diff --git a/pkg/ring/ring_test.go b/pkg/ring/ring_test.go
index f1d038d29..8341b5096 100644
--- a/pkg/ring/ring_test.go
+++ b/pkg/ring/ring_test.go
@@ -161,7 +161,7 @@ func TestRing_Get_ZoneAwarenessWithIngesterLeaving(t *testing.T) {
                                "instance-3": {Addr: "127.0.0.3", Zone: "zone-b", State: ACTIVE},
                                "instance-4": {Addr: "127.0.0.4", Zone: "zone-b", State: ACTIVE},
                                "instance-5": {Addr: "127.0.0.5", Zone: "zone-c", State: LEAVING},
-                               "instance-6": {Addr: "127.0.0.6", Zone: "zone-c", State: ACTIVE},
+                               "instance-6": {Addr: "127.0.0.6", Zone: "zone-c", State: LEAVING},
                        }
                        var prevTokens []uint32
                        for id, instance := range instances {

Expected behavior

Cortex should be resilient to unavailable zones as long as > RF / 2 zones are available

pracucci commented 3 years ago

In the PR https://github.com/cortexproject/cortex/pull/3414 we've handled a similar case for GetReplicationSetForOperation().

I'm wondering for which service and setup you're experiencing the issue. Assuming it's related to the ingesters ring (read/write samples), I guess you're fixing it for the case "shard by all labels" is disabled, because when that is enabled we use GetReplicationSetForOperation() instead of Get() and GetReplicationSetForOperation() should handle it.

Could you share more details about the setup?

jtlisi commented 3 years ago

In the PR #3414 we've handled a similar case for GetReplicationSetForOperation().

I'm wondering for which service and setup you're experiencing the issue. Assuming it's related to the ingesters ring (read/write samples), I guess you're fixing it for the case "shard by all labels" is disabled, because when that is enabled we use GetReplicationSetForOperation() instead of Get() and GetReplicationSetForOperation() should handle it.

Could you share more details about the setup?

This is for remote-write operations to the Distributor with shard by all labels enabled. I may be a bit confused. I thought and it looks to me that Push is called for all writes and that functions uses ring.DoBatch which calls Get. Do we swap out implementations if ShardByAllLabels is enabled?