Closed kaufers closed 6 years ago
Merging #864 into master will increase coverage by
0.17%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #864 +/- ##
==========================================
+ Coverage 48.78% 48.95% +0.17%
==========================================
Files 91 91
Lines 8210 8207 -3
==========================================
+ Hits 4005 4018 +13
+ Misses 3835 3821 -14
+ Partials 370 368 -2
Impacted Files | Coverage Δ | |
---|---|---|
pkg/controller/group/group.go | 50.25% <100%> (+0.25%) |
:arrow_up: |
pkg/provider/terraform/instance/plugin.go | 87.13% <0%> (+0.18%) |
:arrow_up: |
pkg/provider/terraform/instance/apply.go | 69.1% <0%> (+2.54%) |
:arrow_up: |
pkg/rpc/mux/server.go | 47.91% <0%> (+5.2%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 2822331...d123fcf. Read the comment docs.
In addition to my comment on the sort order of the instances to destroy, the bigger question is whether the self vm should terminate itself.
If we have a swarm demote then the self node should be terminated by the new leader, right? I have concerns about the state on disk of the terraform plugins when the host goes down in the middle of a destroy.
the bigger question is whether the self vm should terminate itself.
Yeah, so this gets interesting. We need to allow the manager flavor Drain
in the destroy flow. But should we allow the instance plugin to actually Destroy
itself? Or do we prevent the invocation of the instance plugin's Destroy
API?
If we have a swarm demote then the self node should be terminated by the new leader, right? I have concerns about the state on disk of the terraform plugins when the host goes down in the middle of a destroy.
Actually, today, the instance plugin's Destroy
is called. The tf implementation removes the associated tf.json
file from disk. However, terraform apply
, is never invoked because we only invoke that on the leader (and it's no longer a leader).
Then, when the new leader is elected, the tf instance's DescribeInstances
API only returns 2 (out of 3) managers (since it's based on the tf.json
data and the "old" leader's tf.json
file was removed) so then the group size constraint is not met and another manager is spun up. So, in reality, it's not completing the rolling update and it's just ensuring that the size constraint is satisfied.
With this background, seems like we should allow the flavor Drain
but prevent the instance Destroy
? Thoughts @chungers
If we have a swarm demote then the self node should be terminated by the new leader, right? I have concerns about the state on disk of the terraform plugins when the host goes down in the middle of a destroy. Actually, today, the instance plugin's Destroy is called. The tf implementation removes the associated tf.json file from disk. However, terraform apply, is never invoked because we only invoke that on the leader (and it's no longer a leader).
Then, when the new leader is elected, the tf instance's DescribeInstances API only returns 2 (out of 3) managers (since it's based on the tf.json data and the "old" leader's tf.json file was removed) so then the group size constraint is not met and another manager is spun up. So, in reality, it's not completing the rolling update and it's just ensuring that the size constraint is satisfied.
With this background, seems like we should allow the flavor Drain but prevent the instance Destroy? Thoughts @chungers
That's what I am thinking... here's some logic to check for this: https://github.com/docker/infrakit/blob/master/pkg/controller/group/rollingupdate.go#L40
I think the extraneous policy flags obfuscates the intent here.... Perhaps we should make it a rule to never call instance Destroy on self, instead of all this confusing policy check...
If the self vm nevers calls instance.Destroy, then we'd never modify the tf.json files on disk. Instead, we just let leadership transfer to a new node that was just updated... That new leader will have a clean view of the infrastructure -- it will see that one instance needs an update in the manager group and go through a normal sequence of Destroy and reprovision and re-join as manager. It can make this decision purely looking at the hash tag of the instance and not based on any information from the swarm... then it removes a tf.json file and can call a tf apply. There shouldn't be any races going on. What do you think?
I've opened up https://github.com/docker/infrakit/issues/867 to handle the changes to the rolling update policies.
Please sign your commits following these rules: https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work The easiest way to do this is to amend the last commit:
$ git clone -b "group-destroy-order" git@github.com:kaufers/infrakit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354330656
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f
Amending updates the existing PR. You DO NOT need to open a new one.
When destroying a group, if one of the members is "self" then we should destroy it last. We do not want to destroy the VM that is running the instance plugin before it has completed all of the VM terminations.