Open yajith opened 1 year ago
Hi @yajith,
The default value of 1 second for the readinessProbe and livenessProbe should be more than enough time to respond to Kubelet, which is running on the same K8S worker node. Trident has very little code to process these requests and should respond very quickly. If the Kubernetes worker node is over provisioned to the point where the livenessProbe can't respond within 1 second then you may experience other issues when a volume needs to be attached.
We can consider increasing the values set for the readinessProbe and livenessProbe but would want to know from you what value works. Increasing this value from 1 second to 3 seconds, for instance, would be a very large amount of time to complete these operations.
trident operator deployment doesn't provide a mechanism to customize the readinessProbe and livelinessProbe values. (Based on the feedback that we have received from NetApp support, confirmed via documentation as well)
Documentation link below contains the customization options that are available currently. https://docs.netapp.com/us-en/trident-2207/trident-get-started/kubernetes-customize-deploy.html
Documentation for tridentctl based deployment here allows more customizations. https://docs.netapp.com/us-en/trident-2207/trident-get-started/kubernetes-customize-deploy-tridentctl.html
It appears that this feature is not blocked to prevent users from overriding the default, rather just the operator code yet to be added with that capability.
As per our observations, in some busy cluster environments the trident-csi-xxx pods have a high number of restarts just due to readiness/liveliness probes failing due to timeouts. Upon investigating with both NetApp and Red Hat, it was suggested that updating the probe settings would help with the excessive pod restarts, in the case of trident operator deployments, is not something that can be done as it is.
It was also observed that on busier nodes, trident-csi-xxx pod becomes the victim of these timeouts more often compared to any other workload and potentially it is related to not having any requests and limits defined. This is also something the operator deployment seems to be lacking (unless I'm mistaken).
Raising this as a feature so it can be considered for a future version of operator-based deployments.