Open jhkolb opened 7 years ago
After looking into this in more detail, there are deeper problems. If the DR fails, then reestablishing spawnd
's active subscriptions would be very tricky.
I've tried to make spawnd
more resilient to Bosswave failures now, which also involved some changes to bw2bind
.
DR failures should be handled better. We still have the issue of the (usually local) agent failing as well, but some of this should be handled by restarting through systemd
and spawnd
's efforts to cleanly recover old state.
Let's see how spawnd
behaves in some real deployments for a while before closing this.
You should also try testing netsplits. I find the best way is to use iptables to drop all packets to/from the DR ip. Its distinct from DR failure in that the software doesn't get the TCP RST so it can behave quite differently
Cool, thanks for the advice! Yeah, I'll add that to queue of stuff to do.
When the Bosswave DR for its namespace fails, a spawnpoint daemon also fails.
We need to ensure that
spawnd
prints some informative warnings when this occurs, but continues operation.