cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.89k stars 3.77k forks source link

server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs #70015

Open knz opened 3 years ago

knz commented 3 years ago

Summary

The following two situations occur and share the same root cause:

Desired resolution

  1. A newly created node status record should be annotated with a special status "newly created", and subsequently ignored when computing node liveness, UI node reports, etc.

    The status "newly created" should then be removed (and replaced by "live") the first time the node reports livenesss successfully.

  2. (lower priority) we should try to find a way to persist a node ID that's been allocated during node startup before the store directory has been initialized, and reuse it when starting up again after a crash.

Detail of how the situation occurs

For context, when a new node is added to a cluster, the following happens:

  1. the new node sends a "join" RPC to another pre-existing node
  2. the pre-existing node allocates a node ID for the new node and creates a node status record for it.
  3. the pre-existing node sends the node ID back to the new node
  4. the new node then finalizes its startup, then starts heartbeating its liveness to its status record.

There are two problems with this:

gz#9577

Jira issue: CRDB-9900

irfansharif commented 3 years ago

In our UI we could stop consulting the node status keys and look at liveness records instead. To distinguish between nodes that were able to get their liveness records installed and nodes that were properly booted up after installing liveness records, we could look at the last heartbeat timestamp. The liveness records are created with an empty timestamp -- we rely on the joining node to heartbeat itself once it's fully loaded. https://github.com/cockroachdb/cockroach/issues/50707

knz commented 3 years ago

ok so "zero timestamp" would explain why they appear as dead: the difference between the zero timestamp and the current time is going to always be greater than the "time until store dead".

at least that checks out.

tbg commented 2 years ago

This just occurred again on a customer deployment.

@erikgrinaker points out: if we didn't show these entries as prominently there may not be an issue.