NetApp / trident

Storage orchestrator for containers
Apache License 2.0
762 stars 222 forks source link

Astra Trident operator fails to recognise version because of a too short timeout for the version pod deployment #939

Open robinvalk opened 4 weeks ago

robinvalk commented 4 weeks ago

Describe the bug We installed the astra trident operator on our dev cluster using the Helm chart. We noticed we installed an older chart version so we upgraded to the latest release. Once the new operator pod started it wanted to pull the installation version via some version docker container. It starts a pod with this container and waits for it to come online so it can run a command in the pod to export version details in yaml format.

The problem is it doesn't wait longer than 30 seconds for this pod to show up. After the 30 seconds it concludes that the pod failed to start and it triggers a pod termination process + it queues a retry job. In our environment this pod can take longer than 30 second to start up. There can be many reasons for it to take a bit longer than 30 seconds, the CNI can take a bit before it hands out a pod IP, the container image needs to be pulled via a slow HTTP proxy, etc...

Luckily after retrying many times the version is determined at some point and the install / upgrade procedure continues. This is fine in the dev environment but if we want to run this operator in production it would be nice not to depend on a race condition.

Can this logic be updated so the timeout can be configured?

See "Additional context" for a snapshot of the logs. These logs we see repeatedly for every retry attempt.

Environment Provide accurate information about the environment to help us reproduce the issue.

To Reproduce Steps to reproduce the behavior: As mentioned above we noticed the behaviour after upgrading to a new version. But later we also saw it happening with a fresh install. There is some logic to trigger the operator to check the configured version. Not sure exactly when it's triggered.

Expected behavior A clear and concise description of what you expected to happen: We didn't expect it to terminate the version pod if it doesn't come online. In general running a new pod to get the version out of it is odd behaviour. The pod is started with a version tag so the version is already known 😅, not sure what other information is required.

All in all it should be a little more generous when checking the version. A couple of options come to mind:

Additional context

time="2024-10-22T12:41:30Z" level=error msg="Trident version pod was unable to provide information after 30.00 seconds; err: exec error; unable to upgrade connection: container not found (\"trident-main\")"                                                                                          [825/1824]
time="2024-10-22T12:41:30Z" level=error msg="failed to exec Trident version pod 'transient-trident-version-pod' (image: 'docker.io/netapp/trident:24.06.1') for the information; err: exec error; unable to upgrade connection: container not found (\"trident-main\")"                                           
time="2024-10-22T12:41:30Z" level=debug msg="Updating Trident Orchestrator CR after failed installation." tridentOrchestratorCR=trident                                                                                                                                                                           
time="2024-10-22T12:41:30Z" level=info msg="New status is same as the old status, no update needed." tridentOrchestratorCR=trident                                                                                                                                                                                
time="2024-10-22T12:41:30Z" level=error msg="error syncing 'netapp/trident-node-linux': error re-installing Trident 'trident' ; err: reconcile failed; unable to get Trident image version information, please ensure correct Trident image has been provided; err: failed to exec Trident version pod 'transient-
trident-version-pod' (image: 'docker.io/netapp/trident:24.06.1') for the information; err: exec error; unable to upgrade connection: container not found (\"trident-main\"), requeuing"                                                                                                                           
time="2024-10-22T12:41:30Z" level=info msg=-------------------------------------------------                                                                                                                                                                                                                      
time="2024-10-22T12:41:30Z" level=info msg=-------------------------------------------------                                                                                                                                                                                                                      
time="2024-10-22T12:41:30Z" level=error msg="error syncing 'netapp/trident-node-linux': error re-installing Trident 'trident' ; err: reconcile failed; unable to get Trident image version information, please ensure correct Trident image has been provided; err: failed to exec Trident version pod 'transient-
trident-version-pod' (image: 'docker.io/netapp/trident:24.06.1') for the information; err: exec error; unable to upgrade connection: container not found (\"trident-main\"), requeuing"                                                                                                                           
time="2024-10-22T12:41:30Z" level=info msg="An operator based Trident CSI deployment was found." deploymentName=trident-controller deploymentNamespace=netapp                                                                                                                                                     
time="2024-10-22T12:41:30Z" level=debug msg="Found atleast one CSI Trident deployment created by the operator" deploymentNamespace=netapp                                                                                                                                                                         
time="2024-10-22T12:41:30Z" level=info msg="Reconciler found Trident installation." callingCRName=trident callingResourceType=resourceTridentOrchestratorCR controllingCRBasedOnStatus=trident namespace=netapp operatorBasedCSIDeployments="[trident-controller]"                                                
time="2024-10-22T12:41:30Z" level=debug msg="Identifying controlling CRs from the list of all the CRs."                                                                                                                                                                                                           
time="2024-10-22T12:41:30Z" level=info msg="Found CR that controls current Trident deployment." name=trident                                                                                                                                                                                                      
time="2024-10-22T12:41:30Z" level=debug msg="Controlling CR identified." controllingCR=trident                                                                                                                                                                                                                    
time="2024-10-22T12:41:30Z" level=debug msg="Performing reconcile." controllingCR=trident                                                                                                                                                                                                                         
time="2024-10-22T12:41:30Z" level=debug msg="Initialized installer." namespace=netapp                                                                                                                                                                                                                             
time="2024-10-22T12:41:30Z" level=info msg="A Trident deployment was found by label" deployment=trident-controller namespace=netapp                                                                                                                                                                               
time="2024-10-22T12:41:30Z" level=debug msg="Waiting for 7s after the patch to make sure we get the right trident-pod name."                                                                                                                                                                                      
time="2024-10-22T12:41:37Z" level=info msg="Waiting for Trident pod to start."                                                                                                                                                                                                                                    
time="2024-10-22T12:41:37Z" level=info msg="Trident pod started." namespace=netapp pod=trident-controller-69fb5fcb45-jxzkr                                                                                                                                                                                        
time="2024-10-22T12:41:37Z" level=debug msg="Invoking tunneled command: 'tridentctl -s 127.0.0.1:8000 version -o json'" requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                      
time="2024-10-22T12:41:37Z" level=debug msg="Kubernetes version is supported." version=v1.29.6                                                                                                                                                                                                                    
time="2024-10-22T12:41:38Z" level=debug msg="Initialized installer." namespace=netapp                                                                                                                                                                                                                             
time="2024-10-22T12:41:38Z" level=info msg="A Trident deployment was found by label" deployment=trident-controller namespace=netapp                                                                                                                                                                               
time="2024-10-22T12:41:38Z" level=debug msg="Trident deployment exists."                                                                                                                                                                                                                                          
time="2024-10-22T12:41:38Z" level=debug msg="Current Trident installation image is not same as the new Trident Image." currentTridentImage="docker.io/netapp/trident:23.04.0" newTridentImage="docker.io/netapp/trident:24.06.1"                                                                                  
time="2024-10-22T12:41:38Z" level=debug msg="Current Trident deployment image '23.04.0' is not same as the supported Trident image '24.06.1'."                                                                                                                                                                    
time="2024-10-22T12:41:38Z" level=debug msg="Image version check needed."                                                                                                                                                                                                                                         
time="2024-10-22T12:41:38Z" level=info msg="A Trident service account found by label." namespace=netapp serviceAccount=trident-controller                                                                                                                                                                         
time="2024-10-22T12:41:38Z" level=info msg="A Trident service account found by label." namespace=netapp serviceAccount=trident-node-linux                                                                                                                                                                         
time="2024-10-22T12:41:38Z" level=debug msg="Patching Trident Service account." namespace=netapp serviceAccount=trident-controller                                                                                                                                                                                
time="2024-10-22T12:41:38Z" level=debug msg="Patched Trident service account."                                                                                                                                                                                                                                    
time="2024-10-22T12:41:38Z" level=debug msg="Patching Trident Service account." namespace=netapp serviceAccount=trident-node-linux                                                                                                                                                                                
time="2024-10-22T12:41:38Z" level=debug msg="Patched Trident service account."                                                                                                                                                                                                                                    
time="2024-10-22T12:41:38Z" level=info msg="A Trident cluster role found by label." clusterRole=trident-controller                                                                                                                                                                                                
time="2024-10-22T12:41:38Z" level=debug msg="Patching Trident Cluster role." clusterRole=trident-controller                                                                                                                                                                                                       
time="2024-10-22T12:41:38Z" level=debug msg="Patched Trident Cluster role."                                                                                                                                                                                                                                       
time="2024-10-22T12:41:38Z" level=info msg="A Trident cluster role binding was found by label." clusterRoleBinding=trident-controller                                                                                                                                                                             
time="2024-10-22T12:41:38Z" level=debug msg="Patching Trident Cluster role binding." clusterRoleBinding=trident-controller                                                                                                                                                                                        
time="2024-10-22T12:41:39Z" level=debug msg="Patched Trident Cluster role binding."                                                                                                                                                                                                                               
time="2024-10-22T12:41:39Z" level=info msg="A Trident role found by label." role=trident-controller                                                                                                                                                                                                               
time="2024-10-22T12:41:39Z" level=debug msg="Patching Trident role." role=trident-controller                                                                                                                                                                                                                      
time="2024-10-22T12:41:39Z" level=debug msg="Patched Kubernetes role." label="app=controller.csi.trident.netapp.io" role=trident-controller                                                                                                                                                                       
time="2024-10-22T12:41:39Z" level=debug msg="Patched Trident role."                                                                                                                                                                                                                                               
time="2024-10-22T12:41:39Z" level=info msg="A Trident role binding was found by label." roleBinding=trident-controller                                                                                                                                                                                            
time="2024-10-22T12:41:39Z" level=debug msg="Patching Trident role binding." roleBinding=trident-controller                                                                                                                                                                                                       
time="2024-10-22T12:41:39Z" level=debug msg="Patched Kubernetes role binding." label="app=controller.csi.trident.netapp.io" roleBinding=trident-controller                                                                                                                                                        
time="2024-10-22T12:41:39Z" level=debug msg="Patched Trident role binding."                                                                                                                                                                                                                                       
time="2024-10-22T12:41:39Z" level=info msg="Deleted pod." namespace=netapp pod=transient-trident-version-pod                                                                                                                                                                                                      
time="2024-10-22T12:41:39Z" level=debug msg="Parsed YAML into unstructured object." group= kind=Pod requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown version=v1 workflow="k8s_client=trace_api"                                                                                               
time="2024-10-22T12:41:39Z" level=debug msg="Creating object." kind=Pod name=transient-trident-version-pod namespace=netapp requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                  
time="2024-10-22T12:41:39Z" level=debug msg="Created object by YAML." name=transient-trident-version-pod requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                                     
time="2024-10-22T12:41:39Z" level=info msg="Created Trident version pod." namespace=netapp pod=transient-trident-version-pod                                                                                                                                                                                      
time="2024-10-22T12:41:39Z" level=info msg="Waiting for Trident version pod to provide information."                                                                                                                                                                                                              
time="2024-10-22T12:41:39Z" level=debug msg="Invoking tunneled command: '/bin/tridentctl version --client -o yaml'" requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                          
time="2024-10-22T12:41:39Z" level=debug msg="Unable to get version information from the Trident version pod yet, waiting." increment=693.573594ms message="exec error; unable to upgrade connection: container not found (\"trident-main\")"                                                                      
time="2024-10-22T12:41:40Z" level=debug msg="Invoking tunneled command: '/bin/tridentctl version --client -o yaml'" requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                          
time="2024-10-22T12:41:40Z" level=debug msg="Unable to get version information from the Trident version pod yet, waiting." increment=1.074829275s message="exec error; unable to upgrade connection: container not found (\"trident-main\")"                                                                      
time="2024-10-22T12:41:41Z" level=debug msg="Invoking tunneled command: '/bin/tridentctl version --client -o yaml'" requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                          
time="2024-10-22T12:41:41Z" level=debug msg="Unable to get version information from the Trident version pod yet, waiting." increment=584.808783ms message="exec error; unable to upgrade connection: container not found (\"trident-main\")"                                                                      
time="2024-10-22T12:41:42Z" level=debug msg="Invoking tunneled command: '/bin/tridentctl version --client -o yaml'" requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                          
time="2024-10-22T12:41:42Z" level=debug msg="Unable to get version information from the Trident version pod yet, waiting." increment=2.321321733s message="exec error; unable to upgrade connection: container not found (\"trident-main\")"                                                                      
time="2024-10-22T12:41:44Z" level=debug msg="Invoking tunneled command: '/bin/tridentctl version --client -o yaml'" requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                          
time="2024-10-22T12:41:44Z" level=debug msg="Unable to get version information from the Trident version pod yet, waiting." increment=3.513956958s message="exec error; unable to upgrade connection: container not found (\"trident-main\")"                                                                      
time="2024-10-22T12:41:48Z" level=debug msg="Invoking tunneled command: '/bin/tridentctl version --client -o yaml'" requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                          
time="2024-10-22T12:41:48Z" level=debug msg="Unable to get version information from the Trident version pod yet, waiting." increment=3.88075879s message="exec error; unable to upgrade connection: container not found (\"trident-main\")"                                                                       
time="2024-10-22T12:41:52Z" level=debug msg="Invoking tunneled command: '/bin/tridentctl version --client -o yaml'" requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                          
time="2024-10-22T12:41:52Z" level=debug msg="Unable to get version information from the Trident version pod yet, waiting." increment=3.225349309s message="exec error; unable to upgrade connection: container not found (\"trident-main\")"                                                                      
time="2024-10-22T12:41:55Z" level=debug msg="Invoking tunneled command: '/bin/tridentctl version --client -o yaml'" requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                          
time="2024-10-22T12:41:55Z" level=debug msg="Unable to get version information from the Trident version pod yet, waiting." increment=8.852192639s message="exec error; unable to upgrade connection: container not found (\"trident-main\")"                                                                      
time="2024-10-22T12:42:04Z" level=debug msg="Invoking tunneled command: '/bin/tridentctl version --client -o yaml'" requestID=25ff4b12-29e3-4b5a-aa2b-621d86c29cd9 requestSource=Unknown workflow="k8s_client=trace_api"                                                                                          
time="2024-10-22T12:42:04Z" level=error msg="Trident version pod was unable to provide information after 30.00 seconds; err: exec error; unable to upgrade connection: container not found (\"trident-main\")"                                                                                                    
time="2024-10-22T12:42:04Z" level=error msg="failed to exec Trident version pod 'transient-trident-version-pod' (image: 'docker.io/netapp/trident:24.06.1') for the information; err: exec error; unable to upgrade connection: container not found (\"trident-main\")"                                           
time="2024-10-22T12:42:04Z" level=debug msg="Updating Trident Orchestrator CR after failed installation." tridentOrchestratorCR=trident                                                                                                                                                                           
time="2024-10-22T12:42:04Z" level=info msg="New status is same as the old status, no update needed." tridentOrchestratorCR=trident
time="2024-10-22T12:42:04Z" level=error msg="error syncing 'trident': error re-installing Trident 'trident' ; err: reconcile failed; unable to get Trident image version information, please ensure correct Trident image has been provided; err: failed to exec Trident version pod 'transient-trident-version-po
d' (image: 'docker.io/netapp/trident:24.06.1') for the information; err: exec error; unable to upgrade connection: container not found (\"trident-main\"), requeuing"
time="2024-10-22T12:42:04Z" level=info msg=-------------------------------------------------
time="2024-10-22T12:42:04Z" level=info msg=-------------------------------------------------
time="2024-10-22T12:42:04Z" level=error msg="error syncing 'trident': error re-installing Trident 'trident' ; err: reconcile failed; unable to get Trident image version information, please ensure correct Trident image has been provided; err: failed to exec Trident version pod 'transient-trident-version-po
d' (image: 'docker.io/netapp/trident:24.06.1') for the information; err: exec error; unable to upgrade connection: container not found (\"trident-main\"), requeuing"
time="2024-10-22T12:42:04Z" level=info msg="An operator based Trident CSI deployment was found." deploymentName=trident-controller deploymentNamespace=netapp
time="2024-10-22T12:42:04Z" level=debug msg="Found atleast one CSI Trident deployment created by the operator" deploymentNamespace=netapp
erwin-kok commented 4 weeks ago

Working for the same team as Robin Valk, I propose a fix by increasing the timeout to 5 minutes: https://github.com/NetApp/trident/pull/940

Note there is already exponential backoff logic in-place. However, the limit is 30s (also see logging). After 30s, the operator will delete the POD and start a new one. This process loops indefinitely until the version information could be extracted within 30s. Increasing the timeout to 5 minutes does not affect fast environments. i.e. the operator continues as soon as the version information is extracted.

Making this configurable using a variable in the helm chart is not trivial, and perhaps not wanted.