cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.8k forks source link

server: node catchup phase during startup #116229

Open erikgrinaker opened 10 months ago

erikgrinaker commented 10 months ago

Nodes that have been offline for extended periods of time may have a large backlog of work to do when rejoining the cluster. Currently, this work happens after the node is back online and receives leases, which can severely impact foreground traffic.

We should consider adding another server mode, e.g. modeCatchup, during startup which e.g.:

It should also be possible to bypass this phase via e.g. environment variable, to get a node online asap.

https://github.com/cockroachdb/cockroach/blob/47f40bc27c9b7c76f3ed390601f33355eeb7c744/pkg/server/grpc_server.go#L51-L63

Jira issue: CRDB-34528

erikgrinaker commented 10 months ago

Related to https://github.com/cockroachlabs/support/issues/2758.

andrewbaptist commented 10 months ago

An alternative to consider is delaying the gossiping of the store descriptor until the node believes itself to be healthy. The node would be online and receive Raft traffic but without the store descriptor sent, no other nodes would send it leases.

lunevalex commented 10 months ago

You could keep the node in suspected state until its "healthy", today its just a timeout but it could be driven off the actual health metrics of the node.