Current csi deployments mandate usage of three nodes or else rollout get stuck

leelavg commented 2 months ago

// ============================ we only have two nodes
> k get no
NAME                          STATUS   ROLES    AGE   VERSION
hcp416-bm1-b-2624094b-dn4gn   Ready    worker   22h   v1.29.7+d77deb8
hcp416-bm1-b-2624094b-pvd2m   Ready    worker   22h   v1.29.7+d77deb8

// ============================ controller plugin occupies those two nodes because of replica 2
> k get po -owide | grep -P 'NAME|ceph.com'
NAME                                                             READY  STATUS   RESTARTS  AGE   IP            NODE                       
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd6479rgt  7/7    Running  0         4h6m  10.132.1.14   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd64b5k9p  7/7    Running  0         4h6m  10.133.1.138  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.cephfs.csi.ceph.com-nodeplugin-d5wht           3/3    Running  0         4h6m  10.128.3.248  hcp416-bm1-b-2624094b-dn4gn
openshift-storage.cephfs.csi.ceph.com-nodeplugin-qkg7h           3/3    Running  0         4h6m  10.131.1.157  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-ktwrw    7/7    Running  0         4h6m  10.133.1.139  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-sp2ms    7/7    Running  0         4h6m  10.132.1.13   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-nodeplugin-xfmst              4/4    Running  0         4h6m  10.131.1.157  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-nodeplugin-xvgkp              4/4    Running  0         4h6m  10.128.3.248  hcp416-bm1-b-2624094b-dn4gn

// ============================ can refer https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment
// ============================ we are using default values for rollingUpdate and added hard (ie requiredDuring... rather than preferredDuring...) antiaffinity rules
> k get deploy openshift-storage.cephfs.csi.ceph.com-ctrlplugin -oyaml | yq -r '.spec.strategy,.spec.template.spec.affinity'
rollingUpdate:
  maxSurge: 25%
  maxUnavailable: 25%
type: RollingUpdate
podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: openshift-storage.cephfs.csi.ceph.com-ctrlplugin
      topologyKey: kubernetes.io/hostname

// ============================ can refer https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/#performing-a-rolling-update
// ============================ we are using default values for rollingUpdate
> k get ds openshift-storage.cephfs.csi.ceph.com-nodeplugin -oyaml | yq -r .spec.updateStrategy
rollingUpdate:
  maxSurge: 0
  maxUnavailable: 1
type: RollingUpdate

// ============================ i just update a value (container arg) in operatorconfig so that csi-op updates that value
> k get po -owide | grep -P 'NAME|ceph.com'
NAME                                                             READY  STATUS   RESTARTS  AGE    IP            NODE                       
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd6479rgt  7/7    Running  0         4h15m  10.132.1.14   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-747cffd64b5k9p  7/7    Running  0         4h15m  10.133.1.138  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.cephfs.csi.ceph.com-ctrlplugin-8d98c9787xnm4d  0/7    Pending  0         83s    <none>        <none>                                  
openshift-storage.cephfs.csi.ceph.com-nodeplugin-2gtvl           3/3    Running  0         53s    10.131.1.157  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.cephfs.csi.ceph.com-nodeplugin-gs56r           3/3    Running  0         21s    10.128.3.248  hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-ktwrw    7/7    Running  0         4h15m  10.133.1.139  hcp416-bm1-b-2624094b-pvd2m
openshift-storage.rbd.csi.ceph.com-ctrlplugin-6444559dc-sp2ms    7/7    Running  0         4h15m  10.132.1.13   hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-ctrlplugin-7949d485f9-bnskh   0/7    Pending  0         83s    <none>        <none>                                  
openshift-storage.rbd.csi.ceph.com-nodeplugin-8jln8              4/4    Running  0         81s    10.128.3.248  hcp416-bm1-b-2624094b-dn4gn
openshift-storage.rbd.csi.ceph.com-nodeplugin-scfxd              4/4    Running  0         82s    10.131.1.157  hcp416-bm1-b-2624094b-pvd2m

// due to the existing rule new pod will not rollout we need to use the strategy that we are using for daemonset
> k describe po openshift-storage.cephfs.csi.ceph.com-ctrlplugin-8d98c9787xnm4d
Name:                 openshift-storage.cephfs.csi.ceph.com-ctrlplugin-8d98c9787xnm4d
Namespace:            openshift-storage-client
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      ceph-csi-cephfs-ctrlplugin-sa
Node:                 <none>
Labels:               app=openshift-storage.cephfs.csi.ceph.com-ctrlplugin
                      pod-template-hash=8d98c9787
Annotations:          openshift.io/scc: ceph-csi-op-scc
Status:               Pending
[...]
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  4m27s                  default-scheduler  0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules. preemption: 0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules.
  Warning  FailedScheduling  4m22s (x2 over 4m25s)  default-scheduler  0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules. preemption: 0/2 nodes are available: 2 node(s) didn't match pod anti-affinity rules.

IOW,

we want controller plugin to occupy distinct nodes and when there are only two nodes we occupy each one of it
when a rollout deployment happens, the configuration is unsuitable for rollout, ie, 3 pods (due to maxSurge rounded up) on 3 distinct nodes

Possible solution,

we are already using this for daemonset, do the same for deployment, ie, maxSurge: 0 and maxUnavailable: 1 which forces kube controller to kill one of running pods of deployment which is fine since other pod takes over due to lease expiry and pod from new deployment can start rollout.

Madhu-1 commented 2 months ago

we already have this flexibility with cephcsi helm charts, we can configure the same here https://github.com/ceph/ceph-csi/blob/28dc64dcae3cec8d11d84bdf525bda0ef757c688/charts/ceph-csi-cephfs/values.yaml#L145-L151, This feature is required to the compatibility with cephcsi driver as well.

leelavg commented 2 months ago

ack, in that scenario we might've probably included that in the APIs already, let me do a check.

leelavg commented 2 months ago

I see we have the option for nodeplugin but not for controllerplugin https://github.com/ceph/ceph-csi-operator/blob/28605d8c1cabeee75f1f22dea294f441c1d14f46/api/v1alpha1/driver_types.go#L162-L165

Madhu-1 commented 2 months ago

yes correct, we dont have it from controllerplugin we need to provide the same flexibility for it, the cephcsi pointer is just to show that we support it in cephcsi

nb-ohad commented 2 months ago

I am not sure this needs to be configurable separately for node plugin and controller plugin, and a single nob should have done the trick. But the current API design forces that separation already. So the only "fix" we can offer here is to move the UpdateStategy field to the PodCommonSpec and implement it for controller plugin reconciliation

Madhu-1 commented 2 months ago

I am not sure this needs to be configurable separately for node plugin and controller plugin, and a single nob should have done the trick. But the current API design forces that separation already. So the only "fix" we can offer here is to move the UpdateStategy field to the PodCommonSpec and implement it for controller plugin reconciliation

The deployment updatestrategy and daemonsetupdate strategy are not the same. it needs to be separate, what is the reason to have it under PodSpec and its going to look like?

ceph / ceph-csi-operator

Current csi deployments mandate usage of three nodes or else rollout get stuck #140