VolumeAttachments are orphaned on hibernation

ialidzhikov commented 4 years ago

How to categorize this issue?

/area storage /kind bug /priority normal /platform gcp

What happened: VolumeAttachments are orphaned after hibernation and wake up. Such orphan VolumeAttachments block PV deletion afterwards.

What you expected to happen: PV to do not hang in Terminating because of orphan VolumeAttachments.

How to reproduce it (as minimally and precisely as possible):

Create a Shoot
Create sample PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ebs-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 4Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      volumes:
      - name: persistent-storage
        persistentVolumeClaim:
          claimName: ebs-claim

Ensure that there is one VolumeAttachment as expected

$ k get volumeattachments.storage.k8s.io
NAME                                                                   ATTACHER                PV                                         NODE                                                     ATTACHED   AGE
csi-428c30ed273c01d110380beb8bfac4052e7b3567300c9835421df82a0a99ddf4   pd.csi.storage.gke.io   pv--09fdeb10-ab27-441c-9daf-413238e2491b   shoot--foooooo--gcp-local-cpu-worker-z1-786584bb-gm8kb   true.      90s

Hibernate
Wake up
Ensure that the VolumeAttachment is not cleaned up after wake up

$ k get volumeattachments.storage.k8s.io
NAME                                                                   ATTACHER                PV                                         NODE                                                     ATTACHED   AGE
csi-3a2f98271fdce6a780cb93777e5c5042b64e0782855355fd1cd9a270231857d8   pd.csi.storage.gke.io   pv--09fdeb10-ab27-441c-9daf-413238e2491b   shoot--foooooo--gcp-local-cpu-worker-z1-786584bb-r4gls   true       89s
csi-428c30ed273c01d110380beb8bfac4052e7b3567300c9835421df82a0a99ddf4   pd.csi.storage.gke.io   pv--09fdeb10-ab27-441c-9daf-413238e2491b   shoot--foooooo--gcp-local-cpu-worker-z1-786584bb-gm8kb   true       9m25s

Delete the sample resources from step 2
Ensure that PV hangs in Terminating

$ k get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM               STORAGECLASS   REASON   AGE
pv--09fdeb10-ab27-441c-9daf-413238e2491b   4Gi        RWO            Delete           Terminating   default/ebs-claim   default                 23m

Anything else we need to know?:

Environment:

Gardener version (if relevant): v1.10
Extension version: v1.11.0
Kubernetes version (use kubectl version): v1.18.8
Cloud provider or hardware configuration:
Others:

ialidzhikov commented 4 years ago

I think the root cause is https://github.com/gardener/machine-controller-manager/issues/508

prashanth26 commented 4 years ago

Hi Ismail,

Another reason why this is probably occurring is that the force-delete label has been set by Gardener on machine termination for shoot hibernation, and this was something that was enabled with a recent change to avoid machine drains failing due to PDB violations. With setting of force deletion, the drain logic is triggered but with a much shorter timeout of 1m, and probably this operation never really completes - https://github.com/gardener/machine-controller-manager/blob/master/pkg/controller/machine.go#L582-L589. We could also think about increasing this timeout under force deletion.

cc: @amshuman-kr

amshuman-kr commented 4 years ago

@prashanth26 Would it be a good idea for MCM to wait (during drain) for volumeattachments deletion (instead of or in addition to checking the Node status?

ialidzhikov commented 4 years ago

Hi Ismail,

Another reason why this is probably occurring is that the force-delete label has been set by Gardener on machine termination for shoot hibernation, and this was something that was enabled with a recent change to avoid machine drains failing due to PDB violations. With setting of force deletion, the drain logic is triggered but with a much shorter timeout of 1m, and probably this operation never really completes - https://github.com/gardener/machine-controller-manager/blob/master/pkg/controller/machine.go#L582-L589. We could also think about increasing this timeout under force deletion.

force-delete was also my first guess. Last night I tried to reproduce this issue with a an mcm with fixed wait for CSI PVs. And the issue was again reproducible. I will first address gardener/machine-controller-manager#508, after that I will continue the investigation on this one.

ialidzhikov commented 4 years ago

And this issue is reproducible also for other providers using CSI.

prashanth26 commented 4 years ago

@prashanth26 Would it be a good idea for MCM to wait (during drain) for volumeattachments deletion (instead of or in addition to checking the Node status?

Yes, I think we can plan of having this call for disk detachment before deleting the machine. We already have such implementation in Azure - https://github.com/gardener/machine-controller-manager/blob/master/pkg/driver/driver_azure.go#L855-L858. Have you observed this issue with Azure @ialidzhikov ?

force-delete was also my first guess. Last night I tried to reproduce this issue with a an mcm with fixed wait for CSI PVs. And the issue was again reproducible. I will first address gardener/machine-controller-manager#508, after that I will continue the investigation on this one.

Okay, sounds good.

ialidzhikov commented 4 years ago

Yes, I think we can plan of having this call for disk detachment before deleting the machine. We already have such implementation in Azure - https://github.com/gardener/machine-controller-manager/blob/master/pkg/driver/driver_azure.go#L855-L858. Have you observed this issue with Azure @ialidzhikov ?

@prashanth26 , what you send is a wait until the disks are detached from the azure vm. I think what @amshuman-kr means is to have a wait until the corresponding volumeattachments.storage.k8s.io for the Node are deleted.

ialidzhikov commented 4 years ago

I can confirm that the issue is caused by force deletion of the machines during hibernation. As far as I verified and checked in the source of machine-controller-manager, when force-deletion=True is used mcm deletes the Pods and does not wait for Pod deletion to complete or PVs to be detached from the Node. During hibernation the Pod deletionTimestamp was deletionTimestamp: "2020-09-15T14:24:48Z", the ec2 instance termination timestamp on AWS side was 2020-09-15 14:24:48 GMT - they are both deleted/terminated in the same second. kube-controller-manager is unable to detect that the Pod is deleted, hence the attach-detach controller is unable to call Detach of the in-tree CSI plugin that is is responsible for setting deletionTimestamp on the corresponding VolumeAttachment obj. external-attacher is only watching for VolumeAttachment resources and calling the CSI driver. The whole flow of attach and detach is pretty well described in https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/container-storage-interface.md. And when the wake up happens, new VolumeAttachment is created for the new Node and Pod and the old one is orphaned.

Basically the same happens also when machine is terminated manually on cloud provider side. Or when Pod is deleted and kube-controller manager is down in the same time. The general issue here is that there is no garbage collection for VolumeAttachments in kube-controller-manager. The corresponding upstream issues are https://github.com/kubernetes/kubernetes/issues/77324 and https://github.com/kubernetes-csi/external-attacher/issues/215.

ialidzhikov commented 4 years ago

Any thoughts on how to proceed?

rfranzke commented 4 years ago

Can't we regularly terminate the machines (no force-delete but draining as usual) but without respecting PDBs/any accelerated timeouts (to make it fast)? cc @prashanth26

prashanth26 commented 4 years ago

Can't we regularly terminate the machines (no force-delete but draining as usual) but without respecting PDBs/any accelerated timeouts (to make it fast)? cc @prashanth26

Currently, the drain used in MCM is a slightly modified version (to incorporate serial drain) of the upstream drain. And this drain has logic to respect the PDBs. We can modify this logic to avoid doing that in case of force drain, but i am not sure if that would be a good idea.

But I do get the line of thinking, we probably need two versions of drain one for with force delete that only set's deletion timestamps on PVCs and pods. And then the normal one. I shall think along these lines and discuss with colleagues and get back to you.

@amshuman-kr @hardikdr - Any thoughts on this?

vlerenc commented 4 years ago

@amshuman-kr asked above a similar question I was asking in our sync today:

Would it be a good idea for MCM to wait (during drain) for volumeattachments deletion (instead of or in addition to checking the Node status?

Mine was more: Would it make sense to wait until all volumeattachments are gone, before we shutdown the control plane for good?

amshuman-kr commented 4 years ago

Can't we regularly terminate the machines (no force-delete but draining as usual) but without respecting PDBs/any accelerated timeouts (to make it fast)? cc @prashanth26

AFAIK, the force-delete still fires a delete on the pods (and thus ignoring PDBs) but doesn't wait for volumes to be detached.

https://github.com/gardener/machine-controller-manager/blob/b14447926e5c1172b1683dd25afddcaddc27824c/pkg/controller/drain.go#L447-L457

This is the part that we can change. I.e. wait for volumes to be detached (possibly we could also try the serial deletion (not eviction) during force-delete.

amshuman-kr commented 4 years ago

Mine was more: Would it make sense to wait until all volumeattachments are gone, before we shutdown the control plane for good?

Possibly. But there might be some edge cases here, we need to check if there could be volumes from other sources that don't get detached easily.

ialidzhikov commented 4 years ago

/priority critical

vlerenc commented 4 years ago

Thanks @ialidzhikov for considering that in the context.

ialidzhikov commented 4 years ago

Well, from all of the comments above it is not clear for me how we proceed? Do you think that forceful deletion on machine-controller-manager side needs to be adapted to wait also for volumes to detach? Or is it something that we need to tackle in gardenlet deleting all VolumeAttachments on hibernation (and probably later introduce garbage collection for VolumeAttachments in the shoot care control to handle also other cases in which this issue can occur and check whether we can contribute this to the upstream).

vlerenc commented 4 years ago

We could also think about increasing this timeout under force deletion.

Indeed, @prashanth26 , that would have been a quick win. It's a bandaid as whatever time we define, we may be left with orphans if the attach-detach-controller is down at an inconvenient time, missing the event. Although, on hibernation we force the problem onto us because we shut down the CSI controller and that attach-detach-controller, which makes it a special case. @ialidzhikov has therefore prepared a mitigation with https://github.com/gardener/gardener/pull/2963. Thanks!

The entire situation is a monstrous bug. The whole point of the controller idea with desired/actual state is that shit like this doesn't happen. That's edge- and not level-triggered and you found already the issues https://github.com/kubernetes/kubernetes/issues/77324 and https://github.com/kubernetes-csi/external-attacher/issues/215 which document this. Thanks!

But I do get the line of thinking, we probably need two versions of drain one for with force delete that only set's deletion timestamps on PVCs and pods. And then the normal one. I shall think along these lines and discuss with colleagues and get back to you.

In general (regardless of this problem), yes. Once we make the decision to forcefully delete, the node should be either broken or unwanted and then all gloves are off, we neither respect PDBs nor want to drain pods-with-volumes serially.

I write "regardless", because this is a very special bug here and can happen with any drain logic. If the attach-detach-controller is down or restarting while the event is issued, it's not cleaning up on restart.

Possibly. But there might be some edge cases here, we need to check if there could be volumes from other sources that don't get detached easily.

@amshuman-kr Wasn't that also your kind of proposal when you wrote yourself "would it be a good idea for MCM to wait (during drain) for volumeattachments deletion"? What kind of volumes would that be and can we forcefully detach all volumes another way?

Then again, I think we both were not right. After rereading this thread and bug description, we would rather need an independent remedy controller (cc @dkistner @stoyanr @AndreasBurger ) that cleans up behind this messy CSI controller until we get the bug fix (shouldn't be too difficult, checking regularly nodes vs. volume attachments and deleting orphaned volumes attachments; actually, sounds like a quick one).

The question is, whether it's worth it, now that @ialidzhikov put the hibernation bandaid on. It really depends on how often the CSI controller fails/restarts. While I dislike unsafe code like this (very much), the community was alerted and will eventually provide a fix. Of course, contributing this fix is the better alternative to another remedy, because this one (unlike the other issues we battle in our remedy controller), can be controlled and contained upstream.

I suggest to close or move this ticket. It's neither related to GCP nor to MCM. At best, it could become a hibernation Gardener issue that was mitigated by @ialidzhikov change for now. We either contribute the upstream fix or open another ticket at our remedy controller https://github.com/gardener/remedy-controller/issues.

prashanth26 commented 4 years ago

I suggest to close or move this ticket. It's neither related to GCP nor to MCM. At best, it could become a hibernation Gardener issue that was mitigated by @ialidzhikov change for now.

Okay, Vedran. So we do nothing on the MCM for now thanks to @ialidzhikov 's changes.

However, we could try to improve the force-deletion (in turn force-drain) on MCM eventually to try the serial deletion (not eviction) by setting the deletionTimeStamps on pods with PVs as suggested by @amshuman-kr .

amshuman-kr commented 4 years ago

Possibly. But there might be some edge cases here, we need to check if there could be volumes from other sources that don't get detached easily.

@amshuman-kr Wasn't that also your kind of proposal when you wrote yourself "would it be a good idea for MCM to wait (during drain) for volumeattachments deletion"? What kind of volumes would that be and can we forcefully detach all volumes another way?

@vlerenc Yes. What I meant was for MCM to not necessarily actively delete volumeattachments but at least to wait for them to be deleted before pronouncing that drain is complete. I am not a fan of the current approach of MCM extracting volume information from the node status which requires us to delegate a callback to the drivers to parse the volume IDs because the structure is provider specific. VolumeAttachment provides a neat way for MCM to avoid messing with providers and drivers to find out and wait for volumes attached to the node. But since this was introduced only in 1.19, we cannot get rid of the existing node status based logic. To me it makes sense to include in MCM's drain logic the volumeattachment-based logic in addition to the existing node status based logic so that eventually we can smoothly get rid of the node status logic if/when that becomes obsolete.

But this is clearly lower priority to the other point about serial deletion of pods with volume during force-deletion.

vlerenc commented 4 years ago

@amshuman-kr I find it hard to rely on volume attachments. As you said, this would be possible only with the latest releases and as we see here, we would depend and potentially hold back the node termination if the attach-detach-controller is buggy/offline. For serial pod eviction it may be solvable via timeouts, but as a remedy for the bug above, it wouldn't be if we may not leave any volume attachment. In the future though, it looks like a nice way to avoid IaaS-specific code as you mentioned. Then again, why is that code not in the driver that is IaaS-specific?

As for the serial deletion vs. eviction, in which cases do we use force-deletion? I assumed above, when we decide to force-delete, "all gloves are off" and that would mean, we can dump the pods. Is that "safely" so?

amshuman-kr commented 4 years ago

we would depend and potentially hold back the node termination if the attach-detach-controller is buggy/offline.

@vlerenc Absolutely. I meant when the functionality is more stable and reliable 🙂

In the future though, it looks like a nice way to avoid IaaS-specific code as you mentioned.

👍

Then again, why is that code not in the driver that is IaaS-specific?

It is in the driver. Both in the in-tree and out-of-tree code-lines. I meant that it would be nice to get rid of it from there too. One can keep wishing 😉

vlerenc commented 4 years ago

Ah, thanks. :-)

amshuman-kr commented 4 years ago

As for the serial deletion vs. eviction, in which cases do we use force-deletion? I assumed above, when we decide to force-delete, "all gloves are off" and that would mean, we can dump the pods. Is that "safely" so?

We do force-delete when first drain times out. @hardikdr @prashanth26 please tell us if there other cases where force-deletion is explicitly triggered by gardener/gardenlet/worker controller.

During force delete, all pods in the node are deleted in parallel without waiting for any relevant volumes to be detached. This way at least the pods will get the deletionTimestamp but the drain might complete without waiting for the pods to go away and/or the relevant volumes to be detached. This might cause the shoot control-plane to shutdown during hibernation too early. But this has now been addressed by @ialidzhikov in https://github.com/gardener/gardener/pull/2963 for the hibernation scenario.

But technically similar problems might arise during regular node rollout and the cluster uses CSI volumes. So, addressing it at MCM might still makes sense. Perhaps it does indeed make sense to fire delete on volumeattachments in MCM before pronouncing that drain is complete, especially in the force-deletion case. Serial deletion during force-deletion is relevant only tangentially to babysit bad volume detachment implementations.

prashanth26 commented 4 years ago

We do force-delete when first drain times out. @hardikdr @prashanth26 please tell us if there other cases where force-deletion is explicitly triggered by gardener/gardenlet/worker controller.

I shall go through the drain logic code and look at all possible cases and get back on this one.

Perhaps it does indeed make sense to fire delete on volumeattachments in MCM before pronouncing that drain is complete.

Just getting into the details on this. Firing a delete call on a PVC used by a pod/statefulset will not work right? We might have to rather set the deletion timestamp on the pod, and then the new pod (if backed by a statefulset) will not be rescheduled on this machine as the node is cordoned.

hardikdr commented 4 years ago

Just getting into the details on this. Firing a delete call on a PVC used by a pod/statefulset will not work right?

MCM should never delete a PVC/PV ever IMO.

hardikdr commented 4 years ago

I overall agree with the latest discussions above between Vedran and Amshu:

MCM could in the future rely more on volumeAttachments for its drain and serialized-eviction, as we see Node.Status is a bit unreliable/flapping volume-entries, though that can be mitigated by the small timeouts. But then attach-detach controller also doesn't seem fully stable, it can hold back the deletion of the machine, if offline/buggy.
There's still a possibility of orphan volumeAttachment during node-rollout or unhealthy-machine replacement, we can possibly delete volumeAttachment during drain. This also though need a bit of due diligence, making sure of the current nature and future-plans of volumeAttachments, that MCM doesn't end up deleting something volume-related unnecessarily. A forceful deletion of machine from console or node-object may still leave orphan VAs though.

On top of everything, the best solution is the fix in upstream, where such volumeAttachments are simply garbage-collected, which anyways hopefully will be done by the community. :)

amshuman-kr commented 4 years ago

Just getting into the details on this. Firing a delete call on a PVC used by a pod/statefulset will not work right? We might have to rather set the deletion timestamp on the pod, and then the new pod (if backed by a statefulset) will not be rescheduled on this machine as the node is cordoned.

We should never delete PV/PVC. I meant deletion of only VolumeAttachment which represents only the attachment of the volumes to the node and not the volumes themselves.

ialidzhikov commented 3 years ago

FYI there is https://github.com/kubernetes/kubernetes/pull/96617. I now validated that the upstream PR would also fix the dangling VolumeAttachments after hibernation (that we were leaving before the mitigation in gardener/gardener#2963). On wake up such dangling VolumeAttachments will be garbage collected after maxWaitForUnmountDuration=6m0s. I believe this still may cause minor issues as starting from external-provisioner@v2.0.0 it is not possible to attach if there is already existing VolumeAttachment - see kubernetes-csi/external-provisioner#438. So if we remove the workaround from gardener/gardener#2963 for clusters that have the upstream fix kubernetes/kubernetes#96617, I think there will be still a small disruption on wake up until the dangling VolumeAttachments are garbage collected, and the new VolumeAttachments are processed. So do you see area for improvements in machine-controller-manager/provider-extensions wrt to force deletion (or removing the force deletion of machines during hibernation) and the current handling in mcm that simply deletes the Pods but does not wait volumes to be detached on force deletion? It is kind of odd and not what you expect as usual flow.

prashanth26 commented 3 years ago

So do you see area for improvements in machine-controller-manager/provider-extensions wrt to force deletion (or removing the force deletion of machines during hibernation) and the current handling in mcm that simply deletes the Pods but does not wait volumes to be detached on force deletion? It is kind of odd and not what you expect as usual flow.

Yes, you are right. It is kind of odd to me as well that machines don't wait for volume detachment on hibernating clusters. However, at the same time, the definition of force deletion - would mean for MCM to not wait IMO (open for debate) as if we start waiting this could potentially lead to failures.

However, yes I do think of. a way where we can improve this. How about we update the hiberation logic to avoid setting of force delete labels on machines, and rather just set smaller drain timeouts, reduce them to say 10mins from the current 2hours? This would mean thaat MCM tries for 10mins for detachment and post that force deletes it. And worst case k/k will take care of it as mentioned above.

Just thinking out loud here, feel free to correct me.

amshuman-kr commented 3 years ago

How about we update the hiberation logic to avoid setting of force delete labels on machines, and rather just set smaller drain timeouts, reduce them to say 10mins from the current 2hours?

I agree that the force delete terminology is no longer precise. In MCM, it is taken to simply mean just delete machines without waiting for anything but in gardener there are many cases where it is used but the use-case is slightly different. I think we should list the cases for worker pool scale-down (or generally machine deletion) and their variations. Here is what comes to my mind.

Machine replacement by machineset controller in MCM
Scale down by cluster-autoscaler.
Rolling update of nodes
Shoot hibernation
Normal shoot deletion
Shoot purge due to some infra or technical issues (force deletion again in another context )
Shoot purge due to account/credentials issues (force deletion again in another context )

I would think that the force deletion of machines without waiting for anything should be only done in the case of 6 or if the cases 1-5 timeout and we do the force deletion only on the retry of the drain after the timeout). There is no VM to be deleted in the case of 7 anyway.

gardener / gardener-extension-provider-gcp

VolumeAttachments are orphaned on hibernation #172