Closed kaufers closed 6 years ago
Without any additional changes to the enrollment controller... can we run two instances of enrollment controller, one is SWARM-GRP-READY
=>SWARM-INST-ALL
and the other is SWARM-GRP-READY
=> VM-INST-ALL
. Not worrying about excessive polling for a second, and suppose that we add a removal policy, wouldn't these two combinations deal with all the cases?
If this is workable, we can avoid excessive polling by adding a caching instance plugin similar to how TF was implemented... this can then only query at X second intervals and any queries between samples will just use cached values.
Then we need to handle timeouts. We can add a policy to the enrollment controller... something that says RemoveAfterAdditionalAbsences=N
-- where N defaults to 0 meaning that a Destroy is issued on the first absence (0 additional times). A N > 0 implies N*polling_interval of time tolerance from the first absence of an entry. The implementation is a bit tricky as we need to reset a counter that is keyed by the instance entry as soon as it comes back in the group's entries... but it seems pretty mechanical and generic.
Am I missing anything?
The problem is that the different scenarios are not unique. For example:
SWARM-GRP-READY
=>SWARM-INST-ALL
Scenarios 1 (orphan) and 3 (node down) present the same:
SWARM-GRP-READY SWARM-INST-ALL
n1-link1 n1 (ready)
n2 (down)
n3-link3 n3 (ready)
SWARM-GRP-READY
=>VM-INST-ALL
Scenarios 2 (join failure) and 3 (node down) present the same:
SWARM-GRP-READY VM-INST-ALL
n1-link1 n1-link1
n2-link2
n3-link3 n3-link3
In this case, whatever timeout value we assign to the scenario will always handle the processing.
It seems like we need the data from both VM-INST-ALL
and VM-SWARM-ALL
in order to uniquely identify the scenario.
I'm wondering how generic a problem this really is (in order words, do we need to use the generic enroller to handle this?). Couldn't we create a SwarmConsistency
controller that has timeout values for the different scenarios?
The logic would be something like:
for node in union(VM-INST-ALL, SWARM-INST-ALL):
if node in SWARM-GRP-READY:
continue // healthy node
if node in SWARM-INST-ALL and node not in VM-INST-ALL:
...do ophan logic...
else if node not in SWARM-INST-ALL and node in VM-INST-ALL:
...do join failure logic...
else
...do node down logic...
With this solution we don't need the group/instance plugins for the swarm nodes. The new controller could just get the nodes via the docker client directly. Thoughts?
The Rouge node
case is tricky... The most we can do is to force docker node rm
, but I don't think we can safely delete any instances, nor do we want to. This is because there is no way we can tell that some instance that comes back from the Instance plugin that doesn't have a link tag can ever be safely or correctly correlated to a swarm node. If you have two instances then there's no way to know which one is really running the docker engine... the only way is to connect to the instance and see if the engine returns an id that matches any of the entries in the swarm.
If all we can do in this case is to force docker node rm
, then we are effectively reducing the capacity of the cluster, because as far as infrakit is concerned, the group has the size specified at some point. So docker node rm
will not cause additional nodes to be provisioned. I think if these rouge nodes do appear -- which are entirely possible because provisioning can always be done manually or through other means -- we should leave them alone. Or, at the very least, removing them as swarm nodes should be a matter of policy...
The most we can do is to force
docker node rm
, but I don't think we can safely delete any instances, nor do we want to.
I agree, if anything we'd just remove them from the swarm. And you're right, if we have 2 nodes that have the same link ID then we have no why to know which is the "correct" node.
I was more thinking of the case where someone followed the UCP steps and ran the docker join command on another system that Infrakit is not managing.
Or, at the very least, removing them as swarm nodes should be a matter of policy...
Yeah, I think that this is really a corner case and that we can address it later (if at all).
See "Garbage Collection" section in https://github.com/docker/infrakit/issues/838#issuecomment-360625772
There are 3 different scenarios that we want to handle. To illustrate these, assume that we have 1 group plugin (
SWARM-GRP-READY
that returns allready
swarm nodes) and 2 instance plugins (SWARM-INST-ALL
that returns all swarm nodes andVM-INST-ALL
which is the generic instance plugin that is wired to the group controller).Scenario 1: Orphaned swarm node
In this case, assume that node
n2
was the old leader and it has been demoted but it could not be removed from the swarm. TheVM-INST-ALL
plugin does not have any entry for it since it wasDestroy
d (theflavor.Drain
was also completed in that flow).In this case, we wire the enrollment controller to use
SWARM-GRP-READY
and theSWARM-INST-ALL
; sincen2
is missing from the group then theSWARM-INST-ALL.Destroy
is executed remove the orphan from the swarm.NOTE: that this presents the same as scenario 3 (from the perspective of
SWARM-INST-ALL
)Scenario 2: Node fails to join
In this case, assume that node
n2
wasVM-INST-ALL.Provision
d but it never joined.In this case, we wire the enrollment controller to use
SWARM-GRP-READY
and theVM-INST-ALL
; sincen2
is missing from the group then it is removed fromVM-INST-ALL
. The group controller will then detect that the group is not at the desired size and issue anotherVM-INST-ALL.Provision
.Scenario 3: Node goes offline
In this case, assume that node
n2
wasProvision
d and joins the cluster and then it goes offline.In this case, we have few options. If we wire the enroller to use the
VM-INST-ALL
then theVM-INST-ALL.Destroy
will destroy and replace the node. If we wire the enroller to use theSWARM-INST-ALL
then we really haven’t solved the problem (this just turns into scenario 2).NOTE: that this presents the same as scenario 1 (from the perspective of
SWARM-INST-ALL
)Scenario 4: Rouge node
In this case, assume that node
n2
joined but it not managed by Infrakit.If we wire the enroller to use the
VM-INST-ALL
then the enroller will actually issue aVM-INST-ALL.Provision
(in an effort to sync these up). However, in this case, we simply want to issue aSWARM-INST-ALL.Destroy
to remove it from the swarm. The challenge with this case is that we need the context fromVM-INST-ALL
to even know that it is a rouge node.Timeouts
Another problem that I see deals with handling timeouts. In all 3 scenarios, we do not want to issue the appropriate instance plugin’s
Destroy
on the first detection of a removal delta; we need to add a timeout value (since a node could godown
because of a short network outage and then come back up and beready
again).IMO, the timeout for a node join failure should be longer (like an hour after the VM was successfully created) than the timeout for an offline node (maybe 30 minutes). Unfortunately, the
VM-INST-ALL
cannot differentiate between these 2 scenarios (either way theSWARM-GRP-READY
is missing the entry).I think that we can add something like a removal policy into the enrollment controller; valid options would be:
Duration
after first removal delta detection@chungers Thoughts on how to handle these 3 different scenarios?