Closed tksm closed 1 year ago
I checked how long the terminating state of the trident controller lasted when the issue occurred. In the below sample, it took 22 seconds to complete the shutdown. I guess the longer it takes, the more often the issue occurs.
time | log |
---|---|
2023-03-03T06:59:25Z | Shutting down. |
2023-03-03T06:59:47Z | HTTPS REST frontend server has closed. |
time="2023-03-03T06:59:25Z" level=info msg="Shutting down."
time="2023-03-03T06:59:25Z" level=info msg="Deactivating K8S helper frontend."
time="2023-03-03T06:59:25Z" level=info msg="Deactivating CSI frontend." requestID=fd8bd23a-ce2e-40a6-a04f-d913f5c3efd0 requestSource=Internal
time="2023-03-03T06:59:25Z" level=info msg="Deactivating CRD frontend."
time="2023-03-03T06:59:25Z" level=info msg="Shutting down workers."
time="2023-03-03T06:59:25Z" level=debug msg="TridentCrdController#processNextWorkItem shutting down" logSource=trident-crd-controller requestID=0b1d6aa4-03c7-4ed0-8687-379b7d54ad92 requestSource=CRD
time="2023-03-03T06:59:39Z" level=debug msg="Authenticated by HTTPS REST frontend." peerCert=trident-node
time="2023-03-03T06:59:39Z" level=debug msg="REST API call received." Duration="21.217µs" Method=PUT RequestURL=/trident/v1/node/demo-sts-126-w-default-838e9f9a-2b9b2 Route=AddOrUpdateNode requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=debug msg="Found requested frontend." name=k8s_csi_helper requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=debug msg=GetNode nodeName=demo-sts-126-w-default-838e9f9a-2b9b2 requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=info msg="Determined topology labels for node: map[topology.kubernetes.io/zone:***********************]" node=demo-sts-126-w-default-838e9f9a-2b9b2 requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=debug msg="Node event." eventType=Normal message="[NFS iSCSI] detected on host." name=demo-sts-126-w-default-838e9f9a-2b9b2 reason=TridentServiceDiscovery requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=debug msg=GetNode nodeName=demo-sts-126-w-default-838e9f9a-2b9b2 requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=demo-sts-126-w-default-838e9f9a-2b9b2 requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:39Z" level=debug msg="REST API call complete." Duration=20.850432ms Method=PUT RequestURL=/trident/v1/node/demo-sts-126-w-default-838e9f9a-2b9b2 Route=AddOrUpdateNode StatusCode=201 requestID=e48a0467-4977-4611-87eb-33308f19ed5c requestSource=REST
time="2023-03-03T06:59:47Z" level=info msg="Stopping periodic node access reconciliation service." requestID=94c34b07-0ba2-4909-b8b5-076b56159b66 requestSource=Periodic
time="2023-03-03T06:59:47Z" level=debug msg="Transaction monitor stopped."
time="2023-03-03T06:59:47Z" level=debug msg="Transaction monitor stopped." requestID=a88775dc-0eef-463f-934f-c2c6eecf93cf requestSource=Internal
time="2023-03-03T06:59:47Z" level=info msg="Deactivating metrics frontend." address=":8001"
time="2023-03-03T06:59:47Z" level=info msg="Metrics frontend server has closed." address=":8001"
time="2023-03-03T06:59:47Z" level=info msg="Deactivating HTTP REST frontend." address="127.0.0.1:8000"
time="2023-03-03T06:59:47Z" level=info msg="HTTP REST frontend server has closed." address="127.0.0.1:8000"
time="2023-03-03T06:59:47Z" level=info msg="Deactivating HTTPS REST frontend." address=":8443"
time="2023-03-03T06:59:47Z" level=info msg="HTTPS REST frontend server has closed." address=":8443"
@tksm, thanks for posting this additional information. I believe you are right that the Trident controller is taking longer to terminate than it did previously. The team is considering how to fix this issue.
@tksm This makes a lot of sense. We definitely dont support having multiple Trident controller pods running at the same time. I think fundamentally, this is a kubernetes issue where it's not waiting for the first pod to fully terminate before creating the new pod. It should be waiting since we have "Recreate" strategy, but perhaps k8s has a bug when it's an evicted pod.
A second k8s issue is that the trident-csi-controller Service is sending traffic to a terminating Pod. Possibly related to this https://github.com/kubernetes/kubernetes/issues/110171
I think on the Trident side, there are a couple straight forward mitigations we could implement to shorten the chance of this race condition.
Another idea is to have the Trident Controller reconcile it's knowledge of nodes on a periodic basis, this may have other implications we would need to think through.
Another thing, according to this bug, https://github.com/kubernetes/kubernetes/issues/115819, k8s shouldnt even be waiting for the evicted trident controller to terminate.
@ameade Hi, thanks for the information. I believe the periodic node reconciliation is a fundamental solution, but these mitigations look good to lower the chance of the issue.
I think fundamentally, this is a kubernetes issue where it's not waiting for the first pod to fully terminate before creating the new pod. It should be waiting since we have "Recreate" strategy, but perhaps k8s has a bug when it's an evicted pod.
As far as I know, this is a specification of the "Recreate" strategy rather than a bug. This behavior is noted in the Recreate Deployment section as follows.
If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state).
@tksm ah I see, I guess the Pod eviction is falling under deleting the pod directly and not the deployment upgrade, which makes sense. I have a fix for what is slowing down the Trident shutdown, which should practically remove this race condition. Perhaps another mitigation is changing Trident controller to a statefulset.
Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.
Describe the bug
We occasionally have a problem that specific nodes cannot attach Trident volumes. Once it happens, it never recovers until recreating Trident Node pods.
In this situation, pods with Trident volumes get stuck in the ContainerCreating state with the following error.
After the investigation, we found the following things.
Added a new node
for the affected nodes were found in the old trident-controller pod.Having two trident-controller pods might be the cause of the problem. Some nodes might register to the old trident-controller, so the new trident-controller does not know those nodes, causing the not found error.
Environment Provide accurate information about the environment to help us reproduce the issue.
silenceAutosupport: true
(Trident Operator)To Reproduce
We can confirm that having multiple trident-controller pods causes the issue by the following steps.
Setup Trident with trident-operator v23.01.0
Remove the trident-operator deployment
We manually increase the replicas of the trident-controller deployment since it is difficult to reproduce this issue due to a timing issue.
New nodes will register to one of the five trident-controller pods.
Some of the pods will likely get stuck in the ContainerCreating state.
Expected behavior
All nodes can attach Trident volumes.
Additional context