Open robinvalk opened 4 weeks ago
Working for the same team as Robin Valk, I propose a fix by increasing the timeout to 5 minutes: https://github.com/NetApp/trident/pull/940
Note there is already exponential backoff logic in-place. However, the limit is 30s (also see logging). After 30s, the operator will delete the POD and start a new one. This process loops indefinitely until the version information could be extracted within 30s. Increasing the timeout to 5 minutes does not affect fast environments. i.e. the operator continues as soon as the version information is extracted.
Making this configurable using a variable in the helm chart is not trivial, and perhaps not wanted.
Describe the bug We installed the astra trident operator on our dev cluster using the Helm chart. We noticed we installed an older chart version so we upgraded to the latest release. Once the new operator pod started it wanted to pull the installation version via some version docker container. It starts a pod with this container and waits for it to come online so it can run a command in the pod to export version details in yaml format.
The problem is it doesn't wait longer than 30 seconds for this pod to show up. After the 30 seconds it concludes that the pod failed to start and it triggers a pod termination process + it queues a retry job. In our environment this pod can take longer than 30 second to start up. There can be many reasons for it to take a bit longer than 30 seconds, the CNI can take a bit before it hands out a pod IP, the container image needs to be pulled via a slow HTTP proxy, etc...
Luckily after retrying many times the version is determined at some point and the install / upgrade procedure continues. This is fine in the dev environment but if we want to run this operator in production it would be nice not to depend on a race condition.
Can this logic be updated so the timeout can be configured?
See "Additional context" for a snapshot of the logs. These logs we see repeatedly for every retry attempt.
Environment Provide accurate information about the environment to help us reproduce the issue.
23.04.0
) up to 24.06.1 (Helm chart:100.2406.1
)tridentDebug: true
in the helm chart, that's allTo Reproduce Steps to reproduce the behavior: As mentioned above we noticed the behaviour after upgrading to a new version. But later we also saw it happening with a fresh install. There is some logic to trigger the operator to check the configured version. Not sure exactly when it's triggered.
Expected behavior A clear and concise description of what you expected to happen: We didn't expect it to terminate the version pod if it doesn't come online. In general running a new pod to get the version out of it is odd behaviour. The pod is started with a version tag so the version is already known 😅, not sure what other information is required.
All in all it should be a little more generous when checking the version. A couple of options come to mind:
Additional context