NetApp / trident

Storage orchestrator for containers
Apache License 2.0
759 stars 221 forks source link

Capability to customize readinessProbe/livenessProbe values on a Trident operator deployment #815

Open yajith opened 1 year ago

yajith commented 1 year ago

trident operator deployment doesn't provide a mechanism to customize the readinessProbe and livelinessProbe values. (Based on the feedback that we have received from NetApp support, confirmed via documentation as well)

It appears that this feature is not blocked to prevent users from overriding the default, rather just the operator code yet to be added with that capability.

As per our observations, in some busy cluster environments the trident-csi-xxx pods have a high number of restarts just due to readiness/liveliness probes failing due to timeouts. Upon investigating with both NetApp and Red Hat, it was suggested that updating the probe settings would help with the excessive pod restarts, in the case of trident operator deployments, is not something that can be done as it is.

It was also observed that on busier nodes, trident-csi-xxx pod becomes the victim of these timeouts more often compared to any other workload and potentially it is related to not having any requests and limits defined. This is also something the operator deployment seems to be lacking (unless I'm mistaken).

Raising this as a feature so it can be considered for a future version of operator-based deployments.

gnarl commented 1 year ago

Hi @yajith,

The default value of 1 second for the readinessProbe and livenessProbe should be more than enough time to respond to Kubelet, which is running on the same K8S worker node. Trident has very little code to process these requests and should respond very quickly. If the Kubernetes worker node is over provisioned to the point where the livenessProbe can't respond within 1 second then you may experience other issues when a volume needs to be attached.

We can consider increasing the values set for the readinessProbe and livenessProbe but would want to know from you what value works. Increasing this value from 1 second to 3 seconds, for instance, would be a very large amount of time to complete these operations.