Cascading deletes using the terraform plugin

kaufers commented 6 years ago

The terraform plugin supports defined related resources (for example, a NFS volume for a group of instances and a block storage volume for a single instance). When the group is removed, we want to ensure that all of these related resources are also cleaned up.

We hit problems when the current leader is destroyed first since the VM running terraform is stopped before terraform can finish removing everything.

Seems like we need to do 2 things:

On group Destroy, ensure that the current leader is destroyed last
In the terraform plugin ensure that the depended resources are removed in the same terraform apply cycle as the VM.

Note that 2 is tricky, we cannot simply delete everything except the VM since the VM will not function correctly if the backing storage is removed.

chungers commented 6 years ago

Per Issue #838 and PR #839 -- the leader node will be terminated as the very last step (export INFRAKIT_GROUP_POLICY_SELF_UPDATE=last, which is also the default behavior -- see https://github.com/docker/infrakit/blob/master/pkg/run/v0/group/group.go#L70), the leader will be terminated as the very last node in the rolling update. Please verify this behavior.

This will address 1. of above. If 1. is guaranteed, the next step is to ensure we can properly terminate the vm and all of its resources in a predictable way -- since the "self" node can shut down at any time due to the vm termination and Terraform apply could be mid-flight and potentially leaving Terraform files on disk in a corrupted state.

chungers commented 6 years ago

How we can delete the vm and its associated resources in a way that can be tolerant to terraform apply being interrupted mid-flight due to the self node being shutdown?

Thinking through how Terraform works... I wonder if this can be done at all... If the self node is terminated as part of terraform apply, that process will just die mid-flight. Will this leave the terraform state files on disk in a corrupted state? If we know that terraform at least guarantees file / state consistency at the per-resource granularity, then we could do something with creating tombstones of the resources we need to delete:

Determine a list of resources that needs to be terminated per instance destroy (the vm instance, the volumes).
Create a folder on disk for the 'delete' operation... for example delete-<timestamp>.
In this directory, create symlinks to all the files to be deleted.
At the top level directory, change a symlink (eg. delete-current to point to this new directory).
After the symlink is created, start deleting every files in the delete-current directory.
Now call terraform apply. Terraform will start deleting resources and update its state file as it proceeds (or maybe wait for everything to be deleted then 'commits').
The node running the terraform apply is terminated. Everything goes out.
At this point, other running manager nodes detects the current leader just went offline. A new round of leader election takes place and a new leader (now already updated node) takes over.
The new leader starts up.
The new leader looks at the terraform state files on its disk (which is shared / global mount amongst the managers). It makes sure that all the symlinks in the delete-current directory point to no files... If any symlink resolves (os.Readlink()), it should remove the linked file.
The new leader (its terraform plugin) now calls another terraform apply again.
Terraform apply now runs on the new leader node... and reconciles the infra resources with the on-disk files.

The big assumption here is that any files that Terraform writes (its own state files -- not the ones we create/delete) do not get corrupted mid-flight. This is a pretty big assumption. Is there a way you can verify @kaufers ?

If we don't want to make this assumption or don't trust what is said on the tin, then we would have to do something more coordinated. See my comments on #838

kaufers commented 6 years ago

@chungers I think that what you have for #838 and #839 might actually solve this issue. Today, with the "resource" counting, we remove the "globally" scoped resource files when the last VM that is references them is destroyed. In this case, that means that the terraform apply will include the destroy call for all of the resources (including the self VM).

In my testing on IBM Cloud, the resource destroy API call returns pretty quickly and there is a delay (up to a few minutes) before the actual VM is powered down. This provides plenty of time for all of the resources to be destroyed.

We hit issues when the manager group destroy deletes the current leader first. Once the updates are merged to ensure destroy ordering I'll provide an update to this issue (there may no longer be problems).

docker-archive / deploykit

Cascading deletes using the terraform plugin #840