cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.63k stars 3.71k forks source link

kvcoord: fail-fast when all replicas of a range are unavailable #74503

Open tbg opened 2 years ago

tbg commented 2 years ago

With #33007, when a range loses quorum, we will generally have SQL clients experience fail-fast behavior: access to the unavailable range will immediately result in an error, as opposed to hanging indefinitely (as is the case in 21.2 and before). However, when a range has lost all replicas (or if all replicas are unreachable) I believe that DistSender will keep retrying forever:

While we do try to be resilient to network blips, there is probably value in a heuristic where if a request has been attempted twice for each possible replica, it's time to give up.

We will want to return a RangeUnavailableError in this case (similar to #74500) and have similar SQL UX (#74502).

Jira issue: CRDB-12121

blathers-crl[bot] commented 2 years ago

Hi @tbg, please add a C-ategory label to your issue. Check out the label system docs.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.