Certain nodes cannot attach Trident volumes

Describe the bug

We occasionally have a problem that specific nodes cannot attach Trident volumes. Once it happens, it never recovers until recreating Trident Node pods.

In this situation, pods with Trident volumes get stuck in the ContainerCreating state with the following error.

Warning FailedAttachVolume 14s (x22 over 29m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-310c1228-7477-4fe0-8601-e85365d42d10" : rpc error: code = NotFound desc = node demo-sts-126-w-default-4fc78c0c-hl8kj was not found

After the investigation, we found the following things.

The issue occurs after performing a rolling update of a Kubernetes cluster.
- Our cluster update is done by creating and deleting nodes one by one in our environment.
- A taint evicts pods before node deletion.
Two trident-controller pods could be running simultaneously.
- A new trident-controller pod is created once the old one gets in the terminating state, so we may have two trident-controller pods for a short time.
- Though the trident-controller deployment's strategy is Recreate, it does not prevent having the two pods when eviction.
The logs Added a new node for the affected nodes were found in the old trident-controller pod.

Having two trident-controller pods might be the cause of the problem. Some nodes might register to the old trident-controller, so the new trident-controller does not know those nodes, causing the not found error.

Environment Provide accurate information about the environment to help us reproduce the issue.

Trident version: 23.01.0
Trident installation flags used: silenceAutosupport: true (Trident Operator)
Container runtime: containerd://1.6.16
Kubernetes version: v1.26.1
Kubernetes orchestrator: Kubernetes
Kubernetes enabled feature gates:
OS: Ubuntu 20.04.5 LTS
NetApp backend types: ONTAP AFF 9.7P13
Other:

To Reproduce

We can confirm that having multiple trident-controller pods causes the issue by the following steps.

Setup Trident with trident-operator v23.01.0
Remove the trident-operator deployment

kubectl delete -n trident deploy/trident-operator

Set the trident-controller replicas to 5

We manually increase the replicas of the trident-controller deployment since it is difficult to reproduce this issue due to a timing issue.

kubectl scale -n trident --replicas 5 deploy/trident-controller
kubectl rollout status -n trident deploy/trident-controller

Add some new nodes to the cluster

New nodes will register to one of the five trident-controller pods.

$ stern -n trident -c trident-main --include "Added a new node" --only-log-lines deployment/trident-controller
trident-controller-5d4f5fcdd7-5s56p trident-main time="2023-03-06T05:20:04Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=demo-tkusumi125-w-default-c1cb56a6-qp6gw requestID=6881b720-403e-4890-95f1-e114a1bdca4b requestSource=REST
trident-controller-5d4f5fcdd7-45s5w trident-main time="2023-03-06T05:20:00Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=demo-tkusumi125-w-default-c1cb56a6-2pmdf requestID=b5dc1928-5c80-4b3a-9671-652ffc43c0b2 requestSource=REST

Set the trident-controller replicas to 1

kubectl scale -n trident --replicas 1 deploy/trident-controller

Create a StatefulSet with a Trident volume

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: attach-test
spec:
  replicas: 5 # need to adjust
  podManagementPolicy: "Parallel"
  selector:
    matchLabels:
      app: attach-test
  serviceName: attach-test
  template:
    metadata:
      labels:
        app: attach-test
    spec:
      containers:
      - command:
        - sh
        - -c
        - trap exit INT TERM; sleep infinity & wait
        image: busybox:latest
        name: container1
        volumeMounts:
        - mountPath: /mnt/pvc
          name: pvc
  volumeClaimTemplates:
  - metadata:
      name: pvc
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: ontap-block

Some of the pods will likely get stuck in the ContainerCreating state.

$ kubectl get pods -o wide
NAME            READY   STATUS              RESTARTS   AGE     IP            NODE                                       NOMINATED NODE   READINESS GATES
attach-test-0   0/1     ContainerCreating   0          5m48s   <none>        demo-tkusumi125-w-default-c1cb56a6-2pmdf   <none>           <none>
attach-test-1   0/1     ContainerCreating   0          5m48s   <none>        demo-tkusumi125-w-default-c1cb56a6-qp6gw   <none>           <none>
attach-test-2   1/1     Running             0          5m48s   10.26.12.30   demo-tkusumi125-w-default-c1cb56a6-l6d6v   <none>           <none>
attach-test-3   1/1     Running             0          5m48s   10.26.9.27    demo-tkusumi125-w-default-c1cb56a6-sthsp   <none>           <none>
attach-test-4   1/1     Running             0          5m48s   10.26.13.31   demo-tkusumi125-w-default-c1cb56a6-4xnlh   <none>           <none>

$ kubectl describe pod attach-test-0
...
  Warning  FailedAttachVolume  113s (x10 over 6m8s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-bf76d317-14e3-45cd-864b-35e77c55ebd4" : rpc error: code = NotFound desc = node demo-tkusumi125-w-default-c1cb56a6-2pmdf was not found

Expected behavior

All nodes can attach Trident volumes.

Additional context

I checked how long the terminating state of the trident controller lasted when the issue occurred. In the below sample, it took 22 seconds to complete the shutdown. I guess the longer it takes, the more often the issue occurs.

time	log
2023-03-03T06:59:25Z	Shutting down.
2023-03-03T06:59:47Z	HTTPS REST frontend server has closed.

time="2023-03-03T06:59:25Z" level=info msg="Shutting down."
time="2023-03-03T06:59:25Z" level=info msg="Deactivating K8S helper frontend."
time="2023-03-03T06:59:25Z" level=info msg="Deactivating CSI frontend." requestID=fd8bd23a-ce2e-40a6-a04f-d913f5c3efd0 requestSource=Internal
time="2023-03-03T06:59:25Z" level=info msg="Deactivating CRD frontend."
time="2023-03-03T06:59:25Z" level=info msg="Shutting down workers."
time="2023-03-03T06:59:25Z" level=debug msg="TridentCrdController#processNextWorkItem shutting down" logSource=trident-crd-controller requestID=0b1d6aa4-03c7-4ed0-8687-379b7d54ad92 requestSource=CRD
time="2023-03-03T06:59:39Z" level=debug msg="Authenticated by HTTPS REST frontend." peerCert=trident-node
time="2023-03-03T06:59:39Z" level=debug msg="REST API call received." Duration="21.217µs" Method=PUT RequestURL=/trident/v1/node/demo-sts-126-w-default-838e9f9a-2b9b2 Route=AddOrUpdateNode requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=debug msg="Found requested frontend." name=k8s_csi_helper requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=debug msg=GetNode nodeName=demo-sts-126-w-default-838e9f9a-2b9b2 requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=info msg="Determined topology labels for node: map[topology.kubernetes.io/zone:***********************]" node=demo-sts-126-w-default-838e9f9a-2b9b2 requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=debug msg="Node event." eventType=Normal message="[NFS iSCSI] detected on host." name=demo-sts-126-w-default-838e9f9a-2b9b2 reason=TridentServiceDiscovery requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=debug msg=GetNode nodeName=demo-sts-126-w-default-838e9f9a-2b9b2 requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=demo-sts-126-w-default-838e9f9a-2b9b2 requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=debug msg="REST API call complete." Duration=20.850432ms Method=PUT RequestURL=/trident/v1/node/demo-sts-126-w-default-838e9f9a-2b9b2 Route=AddOrUpdateNode StatusCode=201 requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:47Z" level=info msg="Stopping periodic node access reconciliation service." requestID=94c34b07-0ba2-4909-b8b5-076b56159b66 requestSource=Periodic
time="2023-03-03T06:59:47Z" level=debug msg="Transaction monitor stopped."
time="2023-03-03T06:59:47Z" level=debug msg="Transaction monitor stopped." requestID=a88775dc-0eef-463f-934f-c2c6eecf93cf requestSource=Internal
time="2023-03-03T06:59:47Z" level=info msg="Deactivating metrics frontend." address=":8001"
time="2023-03-03T06:59:47Z" level=info msg="Metrics frontend server has closed." address=":8001"
time="2023-03-03T06:59:47Z" level=info msg="Deactivating HTTP REST frontend." address="127.0.0.1:8000"
time="2023-03-03T06:59:47Z" level=info msg="HTTP REST frontend server has closed." address="127.0.0.1:8000"
time="2023-03-03T06:59:47Z" level=info msg="Deactivating HTTPS REST frontend." address=":8443"
time="2023-03-03T06:59:47Z" level=info msg="HTTPS REST frontend server has closed." address=":8443"

@tksm, thanks for posting this additional information. I believe you are right that the Trident controller is taking longer to terminate than it did previously. The team is considering how to fix this issue.

@tksm This makes a lot of sense. We definitely dont support having multiple Trident controller pods running at the same time. I think fundamentally, this is a kubernetes issue where it's not waiting for the first pod to fully terminate before creating the new pod. It should be waiting since we have "Recreate" strategy, but perhaps k8s has a bug when it's an evicted pod.

A second k8s issue is that the trident-csi-controller Service is sending traffic to a terminating Pod. Possibly related to this https://github.com/kubernetes/kubernetes/issues/110171

I think on the Trident side, there are a couple straight forward mitigations we could implement to shorten the chance of this race condition.

Speed up Trident controller shutdown
Change the order of operations during shutdown to not accept any requests as early as possible (Add node in this case)

Another idea is to have the Trident Controller reconcile it's knowledge of nodes on a periodic basis, this may have other implications we would need to think through.

Another thing, according to this bug, https://github.com/kubernetes/kubernetes/issues/115819, k8s shouldnt even be waiting for the evicted trident controller to terminate.

@ameade Hi, thanks for the information. I believe the periodic node reconciliation is a fundamental solution, but these mitigations look good to lower the chance of the issue.

I think fundamentally, this is a kubernetes issue where it's not waiting for the first pod to fully terminate before creating the new pod. It should be waiting since we have "Recreate" strategy, but perhaps k8s has a bug when it's an evicted pod.

As far as I know, this is a specification of the "Recreate" strategy rather than a bug. This behavior is noted in the Recreate Deployment section as follows.

If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state).

@tksm ah I see, I guess the Pod eviction is falling under deleting the pod directly and not the deployment upgrade, which makes sense. I have a fix for what is slowing down the Trident shutdown, which should practically remove this race condition. Perhaps another mitigation is changing Trident controller to a statefulset.

Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.

NetApp / trident

Certain nodes cannot attach Trident volumes #811