SocketCluster / socketcluster

Highly scalable realtime pub/sub and RPC framework
https://socketcluster.io
MIT License
6.14k stars 313 forks source link

RFE: allow running multiple scc-state servers that vote the leader automatically #596

Open mikkorantalainen opened 2 months ago

mikkorantalainen commented 2 months ago

As documented at https://github.com/SocketCluster/socketcluster/blob/master/scc-guide.md#notes the current design requires that exactly one scc-state server is running for the cluster.

I'm asking for an improvement that I could run one scc-state server per system in the cluster and SocketCluster would automatically select one as the leader and other scc-state servers could just wait as hot standby systems, ready to take over if the hardware running the leader scc-state failed.

This would remove the only remaining single-point-of-failure in SocketCluster if I've understood its behavior correctly.

I think there wouldn't need to be any fancy way to vote the leader. Just make the first server to start the leader and every next server would join in queue to be the next leader. And maybe have a config specifying minimum amount of servers in the queue before designating one as the leader to avoid split brain situation? (For example, for 3 server cluster, one would specify minimum of 2 servers until scc-state server is elected and in that case the first server in the system would be the leader.)

jondubois commented 2 months ago

Yes, that could be achieved. So long as all the scc-worker and scc-broker instances in the cluster agree about which scc-state instance is the leader, it would work.

That said, last time I tested SCC with Kubernetes (K8s) and intentionally (and repeatedly) crashed the scc-state instance, the single scc-state instance didn't seem to be a concern as a single-point of failure because K8s would quickly (within seconds) respawn a new scc-state instance on a different host in the cluster. The new scc-state instance would quickly regain full awareness of the cluster state once all the scc-broker and scc-worker instances had (re)connected to it. It works in K8s because the scc-broker and scc-worker instances connect to the scc-state instance using its service name, not its direct host IP address. So wherever the scc-state instance happens to be respawned (which can be on any host in the cluster), the cluster can recover automatically within seconds.

The current approach works a bit like switching leaders, but the difference is that the new scc-state instance is spawned on-demand after the previous one fails (as opposed to being on standby) so the delay is slightly longer; but note that when an scc-state instance comes down, there is an unavoidable delay involved for all the scc-broker and scc-worker instances in the cluster to detect the crash/disconnection so the leader-switching mechanism still won't fully remove that delay completely regardless.

Anyway, SCC can continue to operate normally (without dropping any messages) even while scc-state is down. The only role of scc-state is cluster discovery; so it only needs to be running during brief moments when the cluster is being launched or resized. The main scenario of concern in terms of single-point-of-failure is one involving multiple simultaneous failures where the scc-state crashes and, within a few seconds before it respawns on a new host, an scc-broker instance also crashes. That's the only scenario I can think of which could result in some loss of messages for brief periods of time. Even in this case, the cluster will heal itself within seconds without intervention.

It's important to note that this is typically acceptable for SC as the focus is cheap/low-overhead publish and subscribe; it's not a message queue so it does not provide delivery guarantee out of the box. If you want that; you need to implement it yourself at the application layer by adding IDs/UUIDs to your message objects and publishing receipt acknowledgments when they reach their target; then you re-publish/retry messages which were not acknowledged. For this use case, so long as the cluster can heal within a few seconds before the message is retried/re-published (or if you retry enough times), it should be possible to achieve any kind of delivery guarantees that way without worrying about rare network failures scenarios.

mikkorantalainen commented 2 months ago

I was thinking about running scc-state on bare metal without Kubernetes. I see that adding Kubernetes to the mix can get pretty similar behavior without SocketCluster being able to handle crashing scc-state by itself.

jondubois commented 2 months ago

@mikkorantalainen Feel free to fork scc-state (https://github.com/SocketCluster/scc-state) and make those changes you suggested. You're welcome to customize it as you like. The approach you suggested has clear advantages when running on bare metal. The trade-off is just added complexity. If you want to share it as open source, I'd be happy to mention it as a possible alternative to the default scc-state.