cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.46k stars 795 forks source link

Cortex seems to miscalculate quorum when one ingester is unhealthy #4654

Open kubicgruenfeld opened 2 years ago

kubicgruenfeld commented 2 years ago

When deploying two different sets of ingesters with different names the calculation of quorum is not working as expected.

We had an incident last night with our cortex production deployment and the circumstances are quite interesting and maybe we found a bug. We are running cortex in microservice mode on k8s. For ingesters we are using two different statefulsets. One statefulset is using network distributed storage and the other statefulset is using localstorage. We only use 2 instances on localstorage and the replication factor we configured is 1. Now we lost one instance from the localstorage ones and cortex refused to accept writes with "at least 2 live replicas required, could only find 1" which is kinda unexpected since we had a lot of running instances using network storage in the same ring. We did not configure any availability zone features and my expectation was, all ingesters are treatet like they are the same deployment since only the names are different. It seems to me, cortex does some assumptions about the instance IDs and the logic regarding the replication factor is build around these assumptions. I could not find it in the code, but it is definetly not documented anywhere.

Steps to reproduce the behavior:

  1. Start Cortex with two ingesters named differently (see pictures).
  2. Set replication_factor to 1.
  3. Try to remote_write metrics.

Expected behavior Writes are working.

Actual behavior Writes are refused with "at least 2 live replicas required, could only find 1".

This is working with replication_factor set to 1 as expected. Screenshot from 2022-02-24 12-39-41

This deployment is not working with replication_factor set to 1, the only difference from above is the naming of ingesters. Screenshot from 2022-02-24 12-42-03

Environment:

Storage Engine

Additional Context Here is a thread about the issue in Slack: https://cloud-native.slack.com/archives/CCYDASBLP/p1645700926882399

kubicgruenfeld commented 2 years ago

Could no reproduce with replication_factor set to 3.

kubicgruenfeld commented 2 years ago

Ok, this is really weird.

replication_factor set to 2, all is working fine with 5 healthy ingesters: Screenshot from 2022-02-24 16-09-00

replication_factor set to 2, unclean shutdown of one ingester but still 4 healthy ingesters causes remote_write failures with error "at least 2 live replicas required, could only find 1": Screenshot from 2022-02-24 16-06-58

bboreham commented 2 years ago

Due to the way quorums are calculated, with replication factor 2 a single process down will cause errors. Now I write that down it strikes me that there is no reason for this; there's no kind of voting on responses, but it is coded that way.

I don't see much evidence of what you said in the title about IDs.

kubicgruenfeld commented 2 years ago

Ah, i was misleaded. Indeed it also breaks with 3 Ingesters named the same, one unhealthy, replication_factor set to 2.

Still unexpected for me, since i would think 2 Ingesters are enough to fulfull replication_factor of 2. The error message also seems incorrect in this case: "at least 2 live replicas required, could only find 1".

bboreham commented 2 years ago

Would you like to change the title of this issue so it's easier to understand what you are asking for?

I'll reference #4293 which is about a similar error message.

kubicgruenfeld commented 2 years ago

Checked with the 1.12 release candidate and could still reproduce the behaviour by forcing one of three ingesters to become unhealthy. It still errors with at least 2 live replicas required, could only find 1 - unhealthy instances: 172.25.8.40:9095

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stefanandres commented 2 years ago

/unstale

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

KrisBuytaert commented 5 months ago

This is actually still a problem on 1.15

friedrichg commented 5 months ago

@KrisBuytaert Which problem? what bryan acknowledged was that the message was misleading.

Using replication factor 1 or 2 is not recommended for most production setups as losing one ingester causes outages due to how the way quorums are calculated. That is working as designed, nothing to be changed there.