Swarm node garbage collection

See "Garbage Collection" section in https://github.com/docker/infrakit/issues/838#issuecomment-360625772

There are 3 different scenarios that we want to handle. To illustrate these, assume that we have 1 group plugin (SWARM-GRP-READY that returns all ready swarm nodes) and 2 instance plugins (SWARM-INST-ALL that returns all swarm nodes and VM-INST-ALL which is the generic instance plugin that is wired to the group controller).

Scenario 1: Orphaned swarm node

In this case, assume that node n2 was the old leader and it has been demoted but it could not be removed from the swarm. The VM-INST-ALL plugin does not have any entry for it since it was Destroyd (the flavor.Drain was also completed in that flow).

SWARM-GRP-READY   SWARM-INST-ALL   VM-INST-ALL
n1-link1          n1 (ready)       n1-link1
                  n2 (down)                   
n3-link3          n3 (ready)       n3-link3

In this case, we wire the enrollment controller to use SWARM-GRP-READY and the SWARM-INST-ALL; since n2 is missing from the group then the SWARM-INST-ALL.Destroy is executed remove the orphan from the swarm.

NOTE: that this presents the same as scenario 3 (from the perspective of SWARM-INST-ALL)

Scenario 2: Node fails to join

In this case, assume that node n2 was VM-INST-ALL.Provisiond but it never joined.

SWARM-GRP-READY   SWARM-INST-ALL   VM-INST-ALL
n1-link1          n1 (ready)       n1-link1
                                   n2-link2
n3-link3          n3 (ready)       n3-link3

In this case, we wire the enrollment controller to use SWARM-GRP-READY and the VM-INST-ALL; since n2 is missing from the group then it is removed from VM-INST-ALL. The group controller will then detect that the group is not at the desired size and issue another VM-INST-ALL.Provision.

Scenario 3: Node goes offline

In this case, assume that node n2 was Provisiond and joins the cluster and then it goes offline.

SWARM-GRP-READY   SWARM-INST-ALL   VM-INST-ALL
n1-link1          n1 (ready)       n1-link1
                  n2 (down)        n2-link2
n3-link3          n3 (ready)       n3-link3

In this case, we have few options. If we wire the enroller to use the VM-INST-ALL then the VM-INST-ALL.Destroy will destroy and replace the node. If we wire the enroller to use the SWARM-INST-ALL then we really haven’t solved the problem (this just turns into scenario 2).

NOTE: that this presents the same as scenario 1 (from the perspective of SWARM-INST-ALL)

Scenario 4: Rouge node

In this case, assume that node n2 joined but it not managed by Infrakit.

SWARM-GRP-READY   SWARM-INST-ALL   VM-INST-ALL
n1-link1          n1 (ready)       n1-link1
n2-link2          n2 (ready)
n3-link3          n3 (ready)       n3-link3

If we wire the enroller to use the VM-INST-ALL then the enroller will actually issue a VM-INST-ALL.Provision (in an effort to sync these up). However, in this case, we simply want to issue a SWARM-INST-ALL.Destroy to remove it from the swarm. The challenge with this case is that we need the context from VM-INST-ALL to even know that it is a rouge node.

Timeouts

Another problem that I see deals with handling timeouts. In all 3 scenarios, we do not want to issue the appropriate instance plugin’s Destroy on the first detection of a removal delta; we need to add a timeout value (since a node could go down because of a short network outage and then come back up and be ready again).

IMO, the timeout for a node join failure should be longer (like an hour after the VM was successfully created) than the timeout for an offline node (maybe 30 minutes). Unfortunately, the VM-INST-ALL cannot differentiate between these 2 scenarios (either way the SWARM-GRP-READY is missing the entry).

I think that we can add something like a removal policy into the enrollment controller; valid options would be:

Remove on detection of first removal delta (default and what we have today)
Remove after exceeding some time Duration after first removal delta detection

@chungers Thoughts on how to handle these 3 different scenarios?

Without any additional changes to the enrollment controller... can we run two instances of enrollment controller, one is SWARM-GRP-READY=>SWARM-INST-ALL and the other is SWARM-GRP-READY => VM-INST-ALL. Not worrying about excessive polling for a second, and suppose that we add a removal policy, wouldn't these two combinations deal with all the cases?

If this is workable, we can avoid excessive polling by adding a caching instance plugin similar to how TF was implemented... this can then only query at X second intervals and any queries between samples will just use cached values.

Then we need to handle timeouts. We can add a policy to the enrollment controller... something that says RemoveAfterAdditionalAbsences=N -- where N defaults to 0 meaning that a Destroy is issued on the first absence (0 additional times). A N > 0 implies N*polling_interval of time tolerance from the first absence of an entry. The implementation is a bit tricky as we need to reset a counter that is keyed by the instance entry as soon as it comes back in the group's entries... but it seems pretty mechanical and generic.

Am I missing anything?

The problem is that the different scenarios are not unique. For example:

`SWARM-GRP-READY`=>`SWARM-INST-ALL`

Scenarios 1 (orphan) and 3 (node down) present the same:

SWARM-GRP-READY   SWARM-INST-ALL
n1-link1          n1 (ready)
                  n2 (down)
n3-link3          n3 (ready)

`SWARM-GRP-READY`=>`VM-INST-ALL`

Scenarios 2 (join failure) and 3 (node down) present the same:

SWARM-GRP-READY   VM-INST-ALL
n1-link1          n1-link1
                  n2-link2
n3-link3          n3-link3

In this case, whatever timeout value we assign to the scenario will always handle the processing.

It seems like we need the data from both VM-INST-ALL and VM-SWARM-ALL in order to uniquely identify the scenario.

I'm wondering how generic a problem this really is (in order words, do we need to use the generic enroller to handle this?). Couldn't we create a SwarmConsistency controller that has timeout values for the different scenarios?

The logic would be something like:

for node in union(VM-INST-ALL, SWARM-INST-ALL):
  if node in SWARM-GRP-READY:
    continue // healthy node
  if node in SWARM-INST-ALL and node not in VM-INST-ALL:
    ...do ophan logic...
  else if node not in SWARM-INST-ALL and node in VM-INST-ALL:
    ...do join failure logic...
  else
    ...do node down logic...

With this solution we don't need the group/instance plugins for the swarm nodes. The new controller could just get the nodes via the docker client directly. Thoughts?

The Rouge node case is tricky... The most we can do is to force docker node rm, but I don't think we can safely delete any instances, nor do we want to. This is because there is no way we can tell that some instance that comes back from the Instance plugin that doesn't have a link tag can ever be safely or correctly correlated to a swarm node. If you have two instances then there's no way to know which one is really running the docker engine... the only way is to connect to the instance and see if the engine returns an id that matches any of the entries in the swarm.

If all we can do in this case is to force docker node rm, then we are effectively reducing the capacity of the cluster, because as far as infrakit is concerned, the group has the size specified at some point. So docker node rm will not cause additional nodes to be provisioned. I think if these rouge nodes do appear -- which are entirely possible because provisioning can always be done manually or through other means -- we should leave them alone. Or, at the very least, removing them as swarm nodes should be a matter of policy...

The most we can do is to force docker node rm, but I don't think we can safely delete any instances, nor do we want to.

I agree, if anything we'd just remove them from the swarm. And you're right, if we have 2 nodes that have the same link ID then we have no why to know which is the "correct" node.

I was more thinking of the case where someone followed the UCP steps and ran the docker join command on another system that Infrakit is not managing.

Or, at the very least, removing them as swarm nodes should be a matter of policy...

Yeah, I think that this is really a corner case and that we can address it later (if at all).

docker-archive / deploykit