server: freshly created nodes appear as dead in UI; also abort during node startup causes unused+dead node IDs

knz commented 3 years ago

Summary

The following two situations occur and share the same root cause:

Freshly created nodes that have not yet heartbeaten their liveness show up immediately as "dead" in the UI and other places that report liveness, before they are marked as "live" a while later.

This is UX surprise because a newly added node should either not show up yet in the UI, or show up as live. The fact it's reported as "dead" is not expected.
Additionally, under certain circumstances (details below), a freshly added node can fail to initialize, and crash, but still acquires a node ID and causes a node descriptor to exist. When this happens, during the next start it will allocate a new node ID. After that, the first node ID that had been allocated will appear to be a dead node and will need to be decommissioned manually.

This is an operational inconvenience because if there is a crash loop during initialization, it's possible for dozens of node IDs to be allocated and immediately appear as dead, and they all need to be cleaned manually afterwards.

Desired resolution

A newly created node status record should be annotated with a special status "newly created", and subsequently ignored when computing node liveness, UI node reports, etc.

The status "newly created" should then be removed (and replaced by "live") the first time the node reports livenesss successfully.
(lower priority) we should try to find a way to persist a node ID that's been allocated during node startup before the store directory has been initialized, and reuse it when starting up again after a crash.

Detail of how the situation occurs

For context, when a new node is added to a cluster, the following happens:

the new node sends a "join" RPC to another pre-existing node
the pre-existing node allocates a node ID for the new node and creates a node status record for it.
the pre-existing node sends the node ID back to the new node
the new node then finalizes its startup, then starts heartbeating its liveness to its status record.

There are two problems with this:

steps 3-4 can last for multiple seconds. During that time, the newly added node will show up as "dead" in the web UI and other places where operators can inspect liveness.

This is surprising.
additionally, if a node crashes in step 4, before it finishes initializing. This is possible e.g. when there is a clock skew: the clock skew detection kicks in when the node re-connects to the cluster after it gets its node ID, and causes a crash, and this crash occurs before the node has finished writing its initial data files in the store directory (and persist its newly allocated node ID).

Because the data directory is not ready, when the node starts again, it appears as if the node has not initialized yet, so it starts again at step 1. This results in an unused node ID which will be forever-dead.

gz#9577

Jira issue: CRDB-9900

irfansharif commented 3 years ago

In our UI we could stop consulting the node status keys and look at liveness records instead. To distinguish between nodes that were able to get their liveness records installed and nodes that were properly booted up after installing liveness records, we could look at the last heartbeat timestamp. The liveness records are created with an empty timestamp -- we rely on the joining node to heartbeat itself once it's fully loaded. https://github.com/cockroachdb/cockroach/issues/50707

knz commented 3 years ago

ok so "zero timestamp" would explain why they appear as dead: the difference between the zero timestamp and the current time is going to always be greater than the "time until store dead".

at least that checks out.

tbg commented 2 years ago

This just occurred again on a customer deployment.

@erikgrinaker points out: if we didn't show these entries as prominently there may not be an issue.

cockroachdb / cockroach