Closed metcalf closed 8 years ago
Having gone and written this, I now think the right approach would be to have a single "command" goroutine managing the booting/killing of all child processes. Separate SlaveNode goroutines would exist only to wait on sockets and publish messages onto a channel for the command goroutine. Given that this diff seems to solve and immediate problem we're having, I'm still inclined to land it and save the larger refactor for another day (month?).
r? @ptarjan
The meat is in 3ad3817. The last two commits (7d7504b and b9afd75) are somewhat optional, tell me if you think they should stay in.
lgtm Yes, you should keep the optional parts, they make it better.
After testing this for about a week with a few developers at Stripe, this appears to be much more stable. I'm going to land this, cut an internal release and distribute to the whole company. I'll follow shortly with a public release when we resolve https://github.com/burke/zeus/issues/570.
We've experienced frequent problems with Zeus nodes getting hung in an Unbooted state. I believe I traced it to a race condition that occurs when a parent reboots while a child is locked booting a command. Before tracking that down, I noticed that there was a decent bit of simplification that could be applied to the state machine that should result in better behavior overall and get rid of this particular bug. See individual commit messages for more details.
Given the subtlety of the concurrent operations here, I intend to roll this out to a handful of developers this week and get them to smoke test it before merging anything.