docker-archive / deploykit

A toolkit for creating and managing declarative, self-healing infrastructure.
Apache License 2.0
2.25k stars 262 forks source link

Use fsm to model the workflow of garbage collections #853

Closed chungers closed 6 years ago

chungers commented 6 years ago

This PR implements a simple POC for #846 using the pkg/fsm state machine models. The POC is implemented as more tests for the fsm package

Some terminology: Index type is for defining possible states while Signal are signals that a fsm will react to. For the purpose of managing swarm nodes, we consider a swarm node to have two components that are joined via a link label. For linking a vm instance to a running Docker engine, there must be a tag of value K for the vm instance and a Docker engine label of value K. Once this association is established, we no longer need to query the actual Docker engine (via TLS or via SSH) for engine identity and our fsm models this using the states:

const (
    start                    Index = iota
    matched_instance               // has vm information, waiting to match to docker_node
    matched_docker_node            // has docker_node information, waiting to match to vm
    swarm_node                     // has matching docker_node and vm information
    swarm_node_ready               // ready as swarm node
    swarm_node_down                // unavailable as swarm node
    pending_instance_destroy       // vm needs to be removed_instance (instance destroy)
    removed_instance               // instance is deleted
    done                           // terminal
)

with signals that we can implement:

const (
    docker_node_ready Signal = iota
    docker_node_down
    docker_node_gone
    instance_ok
    instance_gone
    timeout
    reap
)

Note that we can determine when a Docker Node or when an instance is gone by computing the difference of the sampled sets of nodes/instances. Please see TestSwarmEntities test for the four cases listed in #846

  1. Orphaned node
  2. Node fails to join
  3. Node goes offline
  4. Rouge node

Here is an example state definition:

        State{
            Index: matched_docker_node,
            TTL:   Expiry{wait_describe_instances, instance_gone},
            Transitions: map[Signal]Index{
                instance_ok:      swarm_node,
                instance_gone:    removed_instance,
                docker_node_gone: removed_instance, // could be docker rm'd out of band
            },
            Actions: map[Signal]Action{
                instance_gone: dockerNodeRm.do,
            },
        },

In this example, the state matched_docker_node denotes the state a particular "node" is in when there's a Docker node ls entry. In the Transitions map, a signal instance_ok will transition the state to swarm_node, while instance_gone will transition to removed_instance. In this state we can define a TTL that says we can only be in this state for wait_describe_instances * ticks, where ticks is a polling interval the fsm engine is initialized with. When the deadline is exceeded, the signal instance_gone is raised and an action called dockerNodeRm.do is performed, as specified in the Actions map.

@kaufers

Signed-off-by: David Chung david.chung@docker.com

codecov[bot] commented 6 years ago

Codecov Report

Merging #853 into master will decrease coverage by <.01%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #853      +/-   ##
==========================================
- Coverage   48.46%   48.45%   -0.01%     
==========================================
  Files          89       89              
  Lines        8125     8015     -110     
==========================================
- Hits         3938     3884      -54     
+ Misses       3817     3774      -43     
+ Partials      370      357      -13
Impacted Files Coverage Δ
pkg/rpc/mux/server.go 42.7% <0%> (-5.21%) :arrow_down:
pkg/controller/group/scaled.go
pkg/controller/group/rollingupdate.go
pkg/controller/group/group.go
pkg/controller/group/state.go
pkg/controller/group/quorum.go
pkg/controller/group/lazy.go
pkg/controller/group/controller.go
pkg/controller/group/testplugin.go
pkg/controller/group/scaler.go
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f780dd0...a88635c. Read the comment docs.

GordonTheTurtle commented 6 years ago

Please sign your commits following these rules: https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work The easiest way to do this is to amend the last commit:

$ git clone -b "gc-fsm" git@github.com:chungers/infrakit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354013520
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

chungers commented 6 years ago

merging this because regardless of which implementation design we take, this PR adds at a minimum more tests to the fsm package.