IBM / ibm-spectrum-scale-csi

The IBM Spectrum Scale Container Storage Interface (CSI) project enables container orchestrators, such as Kubernetes and OpenShift, to manage the life-cycle of persistent storage.
Apache License 2.0
68 stars 49 forks source link

Upstream kubernetes scheduler doesn't respect CSI driver unregistered during attachment #773

Open sam6258 opened 2 years ago

sam6258 commented 2 years ago

Describe the bug

A clear and concise description of what the bug is.

Kubernetes scheduler allows pod to be scheduled to a node without CSI driver registered if volume was already provisioned. Application pod fails to attach, when the scheduler should have kept the pod in pending state.

How to Reproduce?

Please list the steps to help development teams reproduce the behavior

  1. Provision an application pod with read write many
  2. Attempt to move that pod to a node with CSI driver registered

Expected behavior

A clear and concise description of what you expected to happen.

Application pod should remain pending until CSI driver is registered (or not be scheduled to a node without the CSI driver if other scheduling requirements such as node affinity are satisfied)

We would expect that kube scheduler would check registration here: https://github.com/kubernetes/kubernetes/blob/642f42d62b5e988ce7327a0dad0cf61895affb8c/pkg/controller/volume/scheduling/scheduler_binder.go#L799

Original conversation with Deepak:

[Scott Miller] @deeghuge is there an existing Scale CSI issue to track upstream kubernetes issue of pod still scheduling on node where CSI driver is not registered in CSINode object? @dunnevan and I were exploring this behavior yesterday because it would help us with application awareness in CNSA. We were hoping to just get CSI to unregister when we need to take CNSA core pod down, then drain application pods using scale pvc from the node, rather than have to do a full drain of the node. Evan tracked down area of code in the scheduler that looks like it could be modified relatively easily to check driver registration. https://github.com/kubernetes/kubernetes/blob/642f42d62b5e988ce7327a0dad0cf61895affb8c/pkg/controller/volume/scheduling/scheduler_binder.go#L799

[Evan Dunn]: CSI does unregister* it's that kube scheduler doesn't check CSINode actually has the desired CSI PV's driver. You would think they would do that if they also support volume topology... but alas seems they only check such things on provision of a volume (for WaitForFirstConsumer)

[Deepak Ghuge] As Evan mentioned everytime csi driver pod moves out of node, that node gets deregistered and AFAIK it is not considered by k8s scheduler. Also we do not support WaitForFirstConsumer and this is mainly for zone/region awareness and used by/in block/cloud. The WaitForFirstConsumer is used by scheduler at the first time provisioning. There are two aspect to above problem Auto draining pods using pvc from scale driver if csidriver pod goes away --> I think there is no k8s interface for this for us to notify about csi driver going down and k8s should start moving pods away from this node. so i guess for now we have to go with drain like we do today When we drain pod, whats guarantee that application pod will land on node where csi is running? Topology may help here so i will check more.

This issue should be used to track the progress of the upstream scheduler resolution.

Jainbrt commented 2 years ago

@sam6258 could you please help putting appropriate FQI labels ?