aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.95k stars 284 forks source link

eksctl anywhere upgrade leaves drivers/controllers in broken state #896

Open smarsh-tim opened 2 years ago

smarsh-tim commented 2 years ago

What happened: After running a cluster upgrade command (where in my case the worker nodes were all replaced), the vsphere-csi-controller has started throwing errors:

Error processing "csi-7f6a00400ffb52421987f76ade13928a8e3f5582144chd03a921e4b5b6b2bb30": failed to detach: rpc error: code = Internal desc = failed to find VirtualMachine for node:"dev-md-0-6f5f5c955-rghzs". Error: node wasn't fo

However, I can confirm that the PersistentVolumes successfully re-attached to the new worker nodes. My persistent workloads have no data loss post-upgrade.

Same with capi-controller-manager:

E1230 16:46:14.917830       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="dev-md-0-5b6bd949cd-qzpxs"
E1230 16:46:14.917864       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="dev-md-0-5b6bd949cd-qzpxs"
E1230 16:46:55.067653       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="could not find infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereMachine \"dev-worker-node-template-1640730669596-qsnqj\" for Machine \"dev-md-0-5b6bd949cd-xwnbv\" in namespace \"eksa-system\", requeuing: requeue in 30s" "controller"="machine" "name"="dev-md-0-5b6bd949cd-xwnbv" "namespace"="eksa-system"
E1230 16:47:10.156694       1 leaderelection.go:331] error retrieving resource lock capi-system/controller-leader-election-capi: etcdserver: leader changed
E1230 16:47:10.345013       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="could not find infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereMachine \"dev-worker-node-template-1640730669596-5rzbc\" for Machine \"dev-md-0-5b6bd949cd-bcwxw\" in namespace \"eksa-system\", requeuing: requeue in 30s" "controller"="machine" "name"="dev-md-0-5b6bd949cd-bcwxw" "namespace"="eksa-system"
I1230 16:47:11.602700 

This seems to result in a number of services experiencing errors like:

Error: error running manager: leader election lost

Seeing this with:

It seems that the old worker nodes are still being referenced.

What you expected to happen: References to old worker nodes to be cleaned up - including VolumeAttachments

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster
  2. Assign a PersistentVolume to a workload
  3. Update the EKS Anywhere cluster spec to a new template (or something that would trigger a node replacement)
  4. Post upgrade - observe the errors in vsphere-csi-driver

Anything else we need to know?: This seems to be similar to https://github.com/kubernetes-csi/external-attacher/issues/215

Environment:

smarsh-tim commented 2 years ago

It seems what's causing the issue is non-graceful/aggressive shutdown of existing worker nodes before the volumes are able to un-attach: https://github.com/kubernetes/enhancements/pull/1116

But then the trade-off would be a longer cluster upgrade time.

smarsh-tim commented 2 years ago

Work-around for the stale VolumeAttachments, here's how to delete them without using the csi driver:

% kubectl get VolumeAttachments | grep dev-md-0-6f5f5c955-rghzs                                                            
csi-17ea8261329e1e8e85160cad351f24294ce52a59efd1a222a464cc73b4184518   csi.vsphere.vmware.com   pvc-4143115b-dbd2-4a19-a114-1e871e9b851e  dev-md-0-6f5f5c955-rghzs    true       30d
csi-7f6a00400ffb52429987f76ade13938a8e3f558214764d03a921e4c5b6b2bb30   csi.vsphere.vmware.com   pvc-200e1a3c-8974-4bca-882c-92ca1a0aa0e0  dev-md-0-6f5f5c955-rghzs    true       47d
csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32   csi.vsphere.vmware.com   pvc-a00a5328-93e1-4eaf-b7b6-ad2e5a362ee0  dev-md-0-6f5f5c955-rghzs    true       30d
csi-bfafbc70d94ab32121f4ba1374462c689d6e8546191410a034af5f5f35f4d076   csi.vsphere.vmware.com   pvc-2227fc30-f155-4458-89db-9dcbb81a5927  dev-md-0-6f5f5c955-rghzs    true       43h
% kubectl edit VolumeAttachment csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  annotations:
    csi.alpha.kubernetes.io/node-id: dev-md-0-6f5f5c955-rghzs
  creationTimestamp: "2021-11-29T20:54:10Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2021-12-28T22:45:04Z"
  # DELETE finalizers:
  # DELETE - external-attacher/csi-vsphere-vmware-com
  name: csi-99d8a57b51da4169a51c75454411c51d8abd1ebc8f8f9d912f117b4f64338c32

After cleaning up the stale Volume Attachments, the controllers stopped having leadership election issues.

g-gaston commented 2 years ago

Looking into this. Thanks @smarsh-tim !

g-gaston commented 2 years ago

We haven't got to this yet. I'm putting it in the backlog for further investigation.

echel0n commented 1 year ago

this issue is becoming more serious, I'm finding that randomly pods can not be created as I am getting a volume is current in use error on container creation when upgrading deployment images as well.

Warning FailedAttachVolume 17m attachdetach-controller AttachVolume.Attach failed for volume "pvc-e1610bdf-5420-444e-99bd-a274642b66e7" : rpc error: code = Internal desc = failed to attach disk: "6b284a9a-582d-4ecc-bb94-bf49017e587a" with node: "192.168.22.161" err failed to attach cns volume: "6b284a9a-582d-4ecc-bb94-bf49017e587a" to node vm: "VirtualMachine:vm-5190 [VirtualCenterHost: vcenter.vsphere.xxx.xxx UUID: 421d6be9-800f-38ca-557d-4fa14b7895a9, Datacenter: Datacenter [Datacenter: Datacenter:datacenter-3, VirtualCenterHost: vcenter.vsphere.xxx.xxx]]". fault: "(*types.LocalizedMethodFault)(0xc000c4f8e0)({\n DynamicData: (types.DynamicData) {\n },\n Fault: (*types.ResourceInUse)(0xc000c6d140)({\n VimFault: (types.VimFault) {\n MethodFault: (types.MethodFault) {\n FaultCause: (*types.LocalizedMethodFault)(<nil>),\n FaultMessage: ([]types.LocalizableMessage) <nil>\n }\n },\n Type: (string) \"\",\n Name: (string) (len=6) \"volume\"\n }),\n LocalizedMessage: (string) (len=32) \"The resource 'volume' is in use.\"\n})\n". opId: "101159e7"

echel0n commented 1 year ago

this is getting worse and worse each day, every time a pod gets restarted its a 50/50 chance that it can create the container cause the volume shows in use, its becoming to a point were EKS-A is not usable in a production state, when will this be fixed ?

echel0n commented 1 year ago

I'll also note that the workaround in this ticket works only if I also detach the cns volume using vcenter mob api, but this is a very lengthy process that also works some of the time