docker-archive / deploykit

A toolkit for creating and managing declarative, self-healing infrastructure.
Apache License 2.0
2.25k stars 262 forks source link

Destroy current VM last in group destroy #864

Closed kaufers closed 6 years ago

kaufers commented 6 years ago

When destroying a group, if one of the members is "self" then we should destroy it last. We do not want to destroy the VM that is running the instance plugin before it has completed all of the VM terminations.

codecov[bot] commented 6 years ago

Codecov Report

Merging #864 into master will increase coverage by 0.17%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #864      +/-   ##
==========================================
+ Coverage   48.78%   48.95%   +0.17%     
==========================================
  Files          91       91              
  Lines        8210     8207       -3     
==========================================
+ Hits         4005     4018      +13     
+ Misses       3835     3821      -14     
+ Partials      370      368       -2
Impacted Files Coverage Δ
pkg/controller/group/group.go 50.25% <100%> (+0.25%) :arrow_up:
pkg/provider/terraform/instance/plugin.go 87.13% <0%> (+0.18%) :arrow_up:
pkg/provider/terraform/instance/apply.go 69.1% <0%> (+2.54%) :arrow_up:
pkg/rpc/mux/server.go 47.91% <0%> (+5.2%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2822331...d123fcf. Read the comment docs.

chungers commented 6 years ago

In addition to my comment on the sort order of the instances to destroy, the bigger question is whether the self vm should terminate itself.

If we have a swarm demote then the self node should be terminated by the new leader, right? I have concerns about the state on disk of the terraform plugins when the host goes down in the middle of a destroy.

kaufers commented 6 years ago

the bigger question is whether the self vm should terminate itself.

Yeah, so this gets interesting. We need to allow the manager flavor Drain in the destroy flow. But should we allow the instance plugin to actually Destroy itself? Or do we prevent the invocation of the instance plugin's Destroy API?

If we have a swarm demote then the self node should be terminated by the new leader, right? I have concerns about the state on disk of the terraform plugins when the host goes down in the middle of a destroy.

Actually, today, the instance plugin's Destroy is called. The tf implementation removes the associated tf.json file from disk. However, terraform apply, is never invoked because we only invoke that on the leader (and it's no longer a leader).

Then, when the new leader is elected, the tf instance's DescribeInstances API only returns 2 (out of 3) managers (since it's based on the tf.json data and the "old" leader's tf.json file was removed) so then the group size constraint is not met and another manager is spun up. So, in reality, it's not completing the rolling update and it's just ensuring that the size constraint is satisfied.

With this background, seems like we should allow the flavor Drain but prevent the instance Destroy? Thoughts @chungers

chungers commented 6 years ago

If we have a swarm demote then the self node should be terminated by the new leader, right? I have concerns about the state on disk of the terraform plugins when the host goes down in the middle of a destroy. Actually, today, the instance plugin's Destroy is called. The tf implementation removes the associated tf.json file from disk. However, terraform apply, is never invoked because we only invoke that on the leader (and it's no longer a leader).

Then, when the new leader is elected, the tf instance's DescribeInstances API only returns 2 (out of 3) managers (since it's based on the tf.json data and the "old" leader's tf.json file was removed) so then the group size constraint is not met and another manager is spun up. So, in reality, it's not completing the rolling update and it's just ensuring that the size constraint is satisfied.

With this background, seems like we should allow the flavor Drain but prevent the instance Destroy? Thoughts @chungers

That's what I am thinking... here's some logic to check for this: https://github.com/docker/infrakit/blob/master/pkg/controller/group/rollingupdate.go#L40

I think the extraneous policy flags obfuscates the intent here.... Perhaps we should make it a rule to never call instance Destroy on self, instead of all this confusing policy check...

If the self vm nevers calls instance.Destroy, then we'd never modify the tf.json files on disk. Instead, we just let leadership transfer to a new node that was just updated... That new leader will have a clean view of the infrastructure -- it will see that one instance needs an update in the manager group and go through a normal sequence of Destroy and reprovision and re-join as manager. It can make this decision purely looking at the hash tag of the instance and not based on any information from the swarm... then it removes a tf.json file and can call a tf apply. There shouldn't be any races going on. What do you think?

kaufers commented 6 years ago

I've opened up https://github.com/docker/infrakit/issues/867 to handle the changes to the rolling update policies.

GordonTheTurtle commented 6 years ago

Please sign your commits following these rules: https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work The easiest way to do this is to amend the last commit:

$ git clone -b "group-destroy-order" git@github.com:kaufers/infrakit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354330656
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.