Open erikgrinaker opened 10 months ago
An alternative to consider is delaying the gossiping of the store descriptor until the node believes itself to be healthy. The node would be online and receive Raft traffic but without the store descriptor sent, no other nodes would send it leases.
You could keep the node in suspected state until its "healthy", today its just a timeout but it could be driven off the actual health metrics of the node.
Nodes that have been offline for extended periods of time may have a large backlog of work to do when rejoining the cluster. Currently, this work happens after the node is back online and receives leases, which can severely impact foreground traffic.
We should consider adding another server mode, e.g.
modeCatchup
, during startup which e.g.:It should also be possible to bypass this phase via e.g. environment variable, to get a node online asap.
https://github.com/cockroachdb/cockroach/blob/47f40bc27c9b7c76f3ed390601f33355eeb7c744/pkg/server/grpc_server.go#L51-L63
Jira issue: CRDB-34528