Open ChipWolf opened 1 year ago
~~I think this will do more harm than good, potentially could cause complicated setup. Like redundant masters should keep the world files in sync, which means more bandwidth if more masters were added, which can slowdown things.~~
i never heard of anybody that master ever crashed since i knew this project
I think this will do more harm than good, potentially could cause complicated setup. Like redundant masters should keep the world files in sync, which means more bandwidth if more masters were added, which can slowdown things.
i never heard of anybody that master ever crashed since i knew this project
Multiple redundant master replicas that have the ability to elect a leader via the gossip protocol would allow for fault tolerance and failover to a functioning master.
The overheads are minimal, additional replicas would be optional, and the file sync can be managed by existing solutions (i.e. replica sets in k8s and volume replication which multi-paper doesn't need to worry about and are already present in modern environments). The only thing which would need to be implemented into multi-paper is the capability for masters to communicate with each other and elect a leader.
Only the leader replica would be in use and failover to a 'hot spare' by re-election following death.
This is essential in a clustered environment; i.e. Minecraft on Kubernetes, that many high-end networks that would need a solution like Multipaper employ in production.
Multiple redundant master replicas that have the ability to elect a leader via the gossip protocol would allow for fault tolerance and failover to a functioning master.
The problem is that you still have a single point of failure - the world files. In 99% of configurations the world files will be on the same machine as Master. So if Master crashes, you can't just switch to a second Master, unless you're doing a RAID 1 over the network.
IMO it's best to just keep the Master stable. As @ham1255 said - the master is only 3k lines of code, there isn't much space to introduce a bug. I've never seen it crash.
Multiple redundant master replicas that have the ability to elect a leader via the gossip protocol would allow for fault tolerance and failover to a functioning master.
The problem is that you still have a single point of failure - the world files. In 99% of configurations the world files will be on the same machine as Master. So if Master crashes, you can't just switch to a second Master, unless you're doing a RAID 1 over the network.
IMO it's best to just keep the Master stable. As @ham1255 said - the master is only 3k lines of code, there isn't much space to introduce a bug. I've never seen it crash.
if master just crashed, the recovery is fast just launch new instance and things would be okay
You can just put Master in a while loop so it restarts on crash. MultiPaper servers will wait those 5 seconds patiently.
Not that Master crashes often... Or ever.
Multiple redundant master replicas that have the ability to elect a leader via the gossip protocol would allow for fault tolerance and failover to a functioning master.
The problem is that you still have a single point of failure - the world files. In 99% of configurations the world files will be on the same machine as Master. So if Master crashes, you can't just switch to a second Master, unless you're doing a RAID 1 over the network.
IMO it's best to just keep the Master stable. As @ham1255 said - the master is only 3k lines of code, there isn't much space to introduce a bug. I've never seen it crash.
Kubernetes has replicated volume providers; our networks use Rook Ceph in some environments and Longhorn in others - these providers handle multiple node failures & catastrophic cluster failures when combined with Kasten or other DR tooling.
The data issue you're discussing is not a problem in modern production ready environments and outside of the scope of multipaper & this feature request.
Outside of Kubernetes environments, even basic scripts that run backups and pull data from NFS would still be totally operable if Multipaper added this feature.
You can just put Master in a while loop so it restarts on crash. MultiPaper servers will wait those 5 seconds patiently.
Not that Master crashes often... Or ever.
The Master itself may not ever crash, but you always have to expect failures in any system, especially when there is external influences.
Nodes might roll for updates in the most basic circumstance, or you may be using "spot instances" in Cloud environments which are [unpredictably] recycled for cost reduction.
Assuming the software is immune to failure isn't the path to building a resilient solution. As I say, the primary consumers of multipaper are high load/high player count networks that have resiliency and redundancy in mind.
Multiple redundant master replicas that have the ability to elect a leader via the gossip protocol would allow for fault tolerance and failover to a functioning master.
The problem is that you still have a single point of failure - the world files. In 99% of configurations the world files will be on the same machine as Master. So if Master crashes, you can't just switch to a second Master, unless you're doing a RAID 1 over the network. IMO it's best to just keep the Master stable. As @ham1255 said - the master is only 3k lines of code, there isn't much space to introduce a bug. I've never seen it crash.
if master just crashed, the recovery is fast just launch new instance and things would be okay
Agree to a point; however if we could have "hot spares" the recovery time would be reduced.
Also, while acknowledging the implementation of this option would be more complex; if Multipaper allowed for multiple "active" masters and allowed them to share responsibility, the blast radius of affected users could be significantly reduced in the event of failure.
i currently have WIP branch for this.
I have chosen Raft as it seems to do what we want
Apache have already made us working Implementation: Apache ratis
Master when started it checks if peers are online if not it marks it self the leader then starts the master
But when new master get started it will be in waiting state were it wont start the master functions until promoted as a leader, so servers wont connect into split masters.
multipaper.yml
# example config
master-replication:
enabled: false
masters:
- id: master-1
host: "localhost:1234"
- id: master-2
host: "localhost:1236"
- id: master-3
host: "localhost:1232"
this branch was droped.
Does it make any sense to work on it right now?
Seems to me like additional redundant code that needs to be maintained and debugged. While there are a lot more pressing matters.
In a year sure - I agree. But now? While the software is still in Alpha?
Does it make any sense to work on it right now?
Seems to me like additional redundant code that needs to be maintained and debugged. While there are a lot more pressing matters.
In a year sure - I agree. But now? While the software is still in Alpha?
Probably worth keeping a vision/roadmap RFC docs folder that can be PR'd into
Using existing solutions such as the gossip protocol, it'd be possible to make the Master redundant. I'd be keen on this being implemented before considering multipaper in production.