BW2 DR Failure Crashes Spawnpoint

SoftwareDefinedBuildings / spawnpoint

Deployment of Distributed, Managed Containers via Bosswave

GNU General Public License v2.0

3 stars 2 forks source link

BW2 DR Failure Crashes Spawnpoint #22

Open jhkolb opened 7 years ago

jhkolb commented 7 years ago

When the Bosswave DR for its namespace fails, a spawnpoint daemon also fails.

We need to ensure that spawnd prints some informative warnings when this occurs, but continues operation.

jhkolb commented 7 years ago

After looking into this in more detail, there are deeper problems. If the DR fails, then reestablishing spawnd's active subscriptions would be very tricky.

jhkolb commented 7 years ago

I've tried to make spawnd more resilient to Bosswave failures now, which also involved some changes to bw2bind.

DR failures should be handled better. We still have the issue of the (usually local) agent failing as well, but some of this should be handled by restarting through systemd and spawnd's efforts to cleanly recover old state.

Let's see how spawnd behaves in some real deployments for a while before closing this.

immesys commented 7 years ago

You should also try testing netsplits. I find the best way is to use iptables to drop all packets to/from the DR ip. Its distinct from DR failure in that the software doesn't get the TCP RST so it can behave quite differently

jhkolb commented 7 years ago

Cool, thanks for the advice! Yeah, I'll add that to queue of stuff to do.