cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.83k stars 3.77k forks source link

kvserver: dead nodes' liveness not gossiped reliably on liveness lease acquisition #99652

Open tbg opened 1 year ago

tbg commented 1 year ago

Describe the problem

In https://github.com/cockroachdb/cockroach/pull/98150, we removed all but the first liveness gossip on liveness lease acquisition. This was thought to be equivalent to gossipping "everything" on extensions as well. However, it turns out it was not but we don't understand why:

https://github.com/cockroachdb/cockroach/issues/99268#issuecomment-1483064544

To Reproduce

See https://github.com/cockroachdb/cockroach/issues/99268#issuecomment-1483064544 which has SHAs on which this reproduced in the replicate/wide roachtest.

Expected behavior

Just works? Nodes 7-9 in replicate/wide ought to be available via gossip reliably on n1-n6 when they come back up.

Jira issue: CRDB-26066

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/replication

tbg commented 1 year ago

As an additional sanity check, just to make absolutely sure that we reverted the actual culprit, I ran 27 iterations of replicate/wide on b4f3ae3346b (the revert's merge SHA). They all passed.

erikgrinaker commented 1 year ago

Reassigning to KV -- the fix will probably be a bit more involved, and we're considering moving liveness gossip to each individual node anyway.

CC @andrewbaptist