Run more than 1 replica of CSI deployment

fhke commented 2 years ago

Describe the solution you'd like We would like to deploy multiple replicas of the CSI deployment. Currently we experience a condition during rolling restarts of our Kubernetes clusters where nodes are restarted while the CSI pod is being recreated, which causes the CSI daemonset pod on those nodes to fail.

Describe alternatives you've considered We are planning to move the CSI deployment to the control plane nodes to reduce the number of times it gets evicted & recreated during cluster maintenance, but our preference would be to run this deployment in a HA configuration by scaling out to multiple replicas.

gnarl commented 2 years ago

Hi @fhke,

We would like to better understand how the Trident daemsonset Pods are failing when the Trident controller Pod is not running in the Kubernetes cluster. When the Trident daemonset Pod has initialized it will attempt to register with the Trident controller Pod. If the Trident daemonset isn't able to register with the Trident controller it will begin to retry indefinitely using a retry backoff mechanism that starts at every 10 seconds and can increase to a maximum of every 120 seconds.

If you are experiencing and issue where the Trident daemonset remains in a failed state we would like to understand why. If possible please open a NetApp support case to help expedite our ability to root cause your issue.

djjudas21 commented 2 years ago

I think @fhke means the trident-csi deployment, not the daemonset. For example, on one of my clusters:

$ oc get deploy -n kube-trident-operator
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
trident-csi        1/1     1            1           96d
trident-operator   1/1     1            1           96d

$ oc get daemonset -n kube-trident-operator
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                     AGE
trident-csi   9         9         9       9            9           kubernetes.io/arch=amd64,kubernetes.io/os=linux   96d

I am also interested in running multiple replicas of the trident-csi deployment, to provide better availability.

capps-b commented 2 months ago

Wanted to chime in to also request HA support for the controller pods.

If the Trident daemonset isn't able to register with the Trident controller it will begin to retry indefinitely using a retry backoff mechanism that starts at every 10 seconds and can increase to a maximum of every 120 seconds.

Yes, but this means that when the controller is evicted there is guaranteed downtime, which risks some pods not being able to mount PVCs during that time. Some of our customers fire off batch jobs that can execute at any time, so these periods of downtime are felt.

In our use case in particular, we would prefer to have the controller pod running in all of our datacenters. In the event of a storm or outage, where one datacenter is completely unreachable, it would be best to have another one running so that we don't have to failover with (even minimal) downtime.

clintonk commented 2 months ago

Hello, @capps-b. The Trident controller maintains an internal cache for performance reasons, so running multiple active replicas would be a heavy lift due to cache coherency concerns. We have considered using leader election to have one active and multiple passive replicas to minimize downtime. But any newly elected leader would still have to populate its cache from CRs, so any downtime mitigation would be minimal. The primary benefit would be time saved by ensuring the needed images are already present on each node where the controller could run. We would also have to enable leader election in the CSI sidecars, which presents another complication. So while technically doable, it hasn't been a high priority.

capps-b commented 2 months ago

maintains an internal cache for performance reasons

Can you elaborate on what is stored here and what the impact might be if say the cache is lost between mount requests?

Let's say I scale the controller deployment to 2 and wait for trident to register the backends. Then, I try to mount a PVC from a pod and k8s directs traffic to the new controller. What happens?

clintonk commented 2 months ago

Can you elaborate on what is stored here

Most information that Trident keeps in CRs (backends, volumes, snapshots, volume attachments, nodes, etc.) is cached in memory. If you have multiple replicas, each could handle part of the incoming requests or even race with the other replicas, both scenarios leading to an inaccurate representation in each replica. Please don't do that!

NetApp / trident

Run more than 1 replica of CSI deployment #729