burke / zeus

Boot any rails app in under a second.
MIT License
3.33k stars 231 forks source link

Simplify and remove some races from SlaveNode state machine #565

Closed metcalf closed 8 years ago

metcalf commented 8 years ago

We've experienced frequent problems with Zeus nodes getting hung in an Unbooted state. I believe I traced it to a race condition that occurs when a parent reboots while a child is locked booting a command. Before tracking that down, I noticed that there was a decent bit of simplification that could be applied to the state machine that should result in better behavior overall and get rid of this particular bug. See individual commit messages for more details.

Given the subtlety of the concurrent operations here, I intend to roll this out to a handful of developers this week and get them to smoke test it before merging anything.

metcalf commented 8 years ago

Having gone and written this, I now think the right approach would be to have a single "command" goroutine managing the booting/killing of all child processes. Separate SlaveNode goroutines would exist only to wait on sockets and publish messages onto a channel for the command goroutine. Given that this diff seems to solve and immediate problem we're having, I'm still inclined to land it and save the larger refactor for another day (month?).

metcalf commented 8 years ago

r? @ptarjan

The meat is in 3ad3817. The last two commits (7d7504b and b9afd75) are somewhat optional, tell me if you think they should stay in.

ptarjan commented 8 years ago

lgtm Yes, you should keep the optional parts, they make it better.

metcalf commented 8 years ago

After testing this for about a week with a few developers at Stripe, this appears to be much more stable. I'm going to land this, cut an internal release and distribute to the whole company. I'll follow shortly with a public release when we resolve https://github.com/burke/zeus/issues/570.