cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.84k stars 3.77k forks source link

server: node catchup phase during startup #116229

Open erikgrinaker opened 8 months ago

erikgrinaker commented 8 months ago

Nodes that have been offline for extended periods of time may have a large backlog of work to do when rejoining the cluster. Currently, this work happens after the node is back online and receives leases, which can severely impact foreground traffic.

We should consider adding another server mode, e.g. modeCatchup, during startup which e.g.:

It should also be possible to bypass this phase via e.g. environment variable, to get a node online asap.

https://github.com/cockroachdb/cockroach/blob/47f40bc27c9b7c76f3ed390601f33355eeb7c744/pkg/server/grpc_server.go#L51-L63

Jira issue: CRDB-34528

erikgrinaker commented 8 months ago

Related to https://github.com/cockroachlabs/support/issues/2758.

andrewbaptist commented 8 months ago

An alternative to consider is delaying the gossiping of the store descriptor until the node believes itself to be healthy. The node would be online and receive Raft traffic but without the store descriptor sent, no other nodes would send it leases.

lunevalex commented 8 months ago

You could keep the node in suspected state until its "healthy", today its just a timeout but it could be driven off the actual health metrics of the node.