MultiPaper / MultiPaper

Multi-server, single-world papermc implementation
https://multipaper.io/
GNU General Public License v3.0
1.21k stars 92 forks source link

[FR] Redundant master #288

Open ChipWolf opened 1 year ago

ChipWolf commented 1 year ago

Using existing solutions such as the gossip protocol, it'd be possible to make the Master redundant. I'd be keen on this being implemented before considering multipaper in production.

ham1255 commented 1 year ago

~~I think this will do more harm than good, potentially could cause complicated setup. Like redundant masters should keep the world files in sync, which means more bandwidth if more masters were added, which can slowdown things.~~

i never heard of anybody that master ever crashed since i knew this project

ChipWolf commented 1 year ago

I think this will do more harm than good, potentially could cause complicated setup. Like redundant masters should keep the world files in sync, which means more bandwidth if more masters were added, which can slowdown things.

i never heard of anybody that master ever crashed since i knew this project

Multiple redundant master replicas that have the ability to elect a leader via the gossip protocol would allow for fault tolerance and failover to a functioning master.

The overheads are minimal, additional replicas would be optional, and the file sync can be managed by existing solutions (i.e. replica sets in k8s and volume replication which multi-paper doesn't need to worry about and are already present in modern environments). The only thing which would need to be implemented into multi-paper is the capability for masters to communicate with each other and elect a leader.

Only the leader replica would be in use and failover to a 'hot spare' by re-election following death.

This is essential in a clustered environment; i.e. Minecraft on Kubernetes, that many high-end networks that would need a solution like Multipaper employ in production.

xymb-endcrystalme commented 1 year ago

Multiple redundant master replicas that have the ability to elect a leader via the gossip protocol would allow for fault tolerance and failover to a functioning master.

The problem is that you still have a single point of failure - the world files. In 99% of configurations the world files will be on the same machine as Master. So if Master crashes, you can't just switch to a second Master, unless you're doing a RAID 1 over the network.

IMO it's best to just keep the Master stable. As @ham1255 said - the master is only 3k lines of code, there isn't much space to introduce a bug. I've never seen it crash.

ham1255 commented 1 year ago

Multiple redundant master replicas that have the ability to elect a leader via the gossip protocol would allow for fault tolerance and failover to a functioning master.

The problem is that you still have a single point of failure - the world files. In 99% of configurations the world files will be on the same machine as Master. So if Master crashes, you can't just switch to a second Master, unless you're doing a RAID 1 over the network.

IMO it's best to just keep the Master stable. As @ham1255 said - the master is only 3k lines of code, there isn't much space to introduce a bug. I've never seen it crash.

if master just crashed, the recovery is fast just launch new instance and things would be okay

xymb-endcrystalme commented 1 year ago

You can just put Master in a while loop so it restarts on crash. MultiPaper servers will wait those 5 seconds patiently.

Not that Master crashes often... Or ever.

ChipWolf commented 1 year ago

Multiple redundant master replicas that have the ability to elect a leader via the gossip protocol would allow for fault tolerance and failover to a functioning master.

The problem is that you still have a single point of failure - the world files. In 99% of configurations the world files will be on the same machine as Master. So if Master crashes, you can't just switch to a second Master, unless you're doing a RAID 1 over the network.

IMO it's best to just keep the Master stable. As @ham1255 said - the master is only 3k lines of code, there isn't much space to introduce a bug. I've never seen it crash.

Kubernetes has replicated volume providers; our networks use Rook Ceph in some environments and Longhorn in others - these providers handle multiple node failures & catastrophic cluster failures when combined with Kasten or other DR tooling.

The data issue you're discussing is not a problem in modern production ready environments and outside of the scope of multipaper & this feature request.

Outside of Kubernetes environments, even basic scripts that run backups and pull data from NFS would still be totally operable if Multipaper added this feature.

ChipWolf commented 1 year ago

You can just put Master in a while loop so it restarts on crash. MultiPaper servers will wait those 5 seconds patiently.

Not that Master crashes often... Or ever.

The Master itself may not ever crash, but you always have to expect failures in any system, especially when there is external influences.

Nodes might roll for updates in the most basic circumstance, or you may be using "spot instances" in Cloud environments which are [unpredictably] recycled for cost reduction.

Assuming the software is immune to failure isn't the path to building a resilient solution. As I say, the primary consumers of multipaper are high load/high player count networks that have resiliency and redundancy in mind.

ChipWolf commented 1 year ago

Multiple redundant master replicas that have the ability to elect a leader via the gossip protocol would allow for fault tolerance and failover to a functioning master.

The problem is that you still have a single point of failure - the world files. In 99% of configurations the world files will be on the same machine as Master. So if Master crashes, you can't just switch to a second Master, unless you're doing a RAID 1 over the network. IMO it's best to just keep the Master stable. As @ham1255 said - the master is only 3k lines of code, there isn't much space to introduce a bug. I've never seen it crash.

if master just crashed, the recovery is fast just launch new instance and things would be okay

Agree to a point; however if we could have "hot spares" the recovery time would be reduced.

Also, while acknowledging the implementation of this option would be more complex; if Multipaper allowed for multiple "active" masters and allowed them to share responsibility, the blast radius of affected users could be significantly reduced in the event of failure.

ham1255 commented 1 year ago

i currently have WIP branch for this.

faqs

Raft protocol

servers stuff / config

# example config
master-replication:
   enabled: false 
   masters:
     - id: master-1
       host: "localhost:1234"
     - id: master-2
       host: "localhost:1236"
     - id: master-3
       host: "localhost:1232"

this branch was droped.

xymb-endcrystalme commented 1 year ago

Does it make any sense to work on it right now?

Seems to me like additional redundant code that needs to be maintained and debugged. While there are a lot more pressing matters.

In a year sure - I agree. But now? While the software is still in Alpha?

ChipWolf commented 1 year ago

Does it make any sense to work on it right now?

Seems to me like additional redundant code that needs to be maintained and debugged. While there are a lot more pressing matters.

In a year sure - I agree. But now? While the software is still in Alpha?

Probably worth keeping a vision/roadmap RFC docs folder that can be PR'd into