NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.69k stars 277 forks source link

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster #584

Open shnigam2 opened 10 months ago

shnigam2 commented 10 months ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster. In description of these pods shows :-

Events:
  Type     Reason        Age    From             Message
  ----     ------        ----   ----             -------
  Warning  NodeNotReady  3m49s  node-controller  Node is not ready
 k get events -n gpu-operator
LAST SEEN   TYPE      REASON                 OBJECT                                                 MESSAGE
4m          Warning   NodeNotReady           pod/gpu-feature-discovery-rq8p4                        Node is not ready
4m1s        Warning   NodeNotReady           pod/gpu-operator-node-feature-discovery-worker-jqfk8   Node is not ready
4m1s        Warning   NodeNotReady           pod/nvidia-container-toolkit-daemonset-6twjt           Node is not ready
93s         Normal    TaintManagerEviction   pod/nvidia-cuda-validator-7zxdq                        Cancelling deletion of Pod gpu-operator/nvidia-cuda-validator-7zxdq
4m1s        Warning   NodeNotReady           pod/nvidia-dcgm-exporter-vffrj                         Node is not ready
4m1s        Warning   NodeNotReady           pod/nvidia-device-plugin-daemonset-jwlqh               Node is not ready
93s         Normal    TaintManagerEviction   pod/nvidia-device-plugin-validator-gvtsg               Cancelling deletion of Pod gpu-operator/nvidia-device-plugin-validator-gvtsg
4m1s        Warning   NodeNotReady           pod/nvidia-driver-daemonset-8jbgc                      Node is not ready
4m1s        Warning   NodeNotReady           pod/nvidia-operator-validator-62h5p                    Node is not ready

Logs of k8s-driver-manager before terminating gpu node

k logs nvidia-driver-daemonset-5gt5q -c k8s-driver-manager -n  gpu-operator 
Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
Current value of 'nvidia.com/gpu.deploy.mig-manager='
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
Current value of 'nodeType='
Current value of AUTO_UPGRADE_POLICY_ENABLED='
Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
node/ip-10-222-101-214.ec2.internal labeled
Waiting for the operator-validator to shutdown
pod/nvidia-operator-validator-hhrnx condition met
Waiting for the container-toolkit to shutdown
pod/nvidia-container-toolkit-daemonset-kb6s4 condition met
Waiting for the device-plugin to shutdown
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Auto upgrade policy of the GPU driver on the node ip-10-222-101-214.ec2.internal is disabled
Cordoning node ip-10-222-101-214.ec2.internal...
node/ip-10-222-101-214.ec2.internal cordoned
Draining node ip-10-222-101-214.ec2.internal of any GPU pods...
W0922 16:03:37.375717    7767 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-09-22T16:03:37Z" level=info msg="Identifying GPU pods to delete"
time="2023-09-22T16:03:37Z" level=info msg="No GPU pods to delete. Exiting."
unbinding device 0000:00:1e.0
Auto upgrade policy of the GPU driver on the node ip-10-222-101-214.ec2.internal is disabled
Uncordoning node ip-10-222-101-214.ec2.internal...
node/ip-10-222-101-214.ec2.internal uncordoned
Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
node/ip-10-222-101-214.ec2.internal labeled

Value which we are passing for helm :-

      source:
        path: deployments/gpu-operator
        repoURL: https://github.com/NVIDIA/gpu-operator.git
        targetRevision: v23.3.1
        helm:
          releaseName: gpu-operator
          values: |-
            validator:
              repository: our-repo/nvidia
              imagePullSecrets:
                - image-secret
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            daemonsets:
              priorityClassName: system-node-critical
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            operator:
              repository: our-repo/nvidia
              image: gpu-operator
              version: v22.9.0-ubi8
              imagePullSecrets: [image-secret]
              defaultRuntime: containerd
              tolerations:
              - key: "node-role.kubernetes.io/master"
                operator: "Equal"
                value: ""
                effect: "NoSchedule"
            driver:
              enabled: true
              repository: our-repo/nvidia
              image: nvidia-kmods-driver-flatcar
              version: '{{values.driverImage}}'
              imagePullSecrets:
              - image-secret
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            toolkit:
              enabled: true
              repository: our-repo/nvidia
              image: container-toolkit
              version: v1.13.0-ubuntu20.04
              imagePullSecrets:
              - image-secret
              tolerations:
              - key: gpu.kubernetes.io/gpu-exists
                operator: Exists
                effect: NoSchedule
            devicePlugin:
              repository: our-repo/nvidia
              imagePullSecrets:
                - image-secret
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            dcgm:
              repository: our-repo/nvidia
              image: 3.1.7-1-ubuntu20.04
              imagePullSecrets:
              - image-secret
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            dcgmExporter:
              repository: our-repo/nvidia
              image: dcgm-exporter
              imagePullSecrets:
              - frog-auth
              version: 3.1.7-3.1.4-ubuntu20.04
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            gfd:
              repository: our-repo/nvidia
              image: gpu-feature-discovery
              version: v0.8.0-ubi8
              imagePullSecrets:
              - image-secret
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            migManager:
              enabled: true
              repository: our-repo/nvidia
              image: k8s-mig-manager
              version: v0.5.2-ubuntu20.04
              imagePullSecrets:
              - image-secret
              tolerations:
                - key: gpu.kubernetes.io/gpu-exists
                  operator: Exists
                  effect: NoSchedule
            node-feature-discovery:
              image:
                repository: our-repo/nvidia/node-feature-discovery
              imagePullSecrets:
              - name: image-secret
              worker:
                tolerations:
                - key: "gpu.kubernetes.io/gpu-exists"
                  operator: "Equal"
                  value: ""
                  effect: "NoSchedule"
                nodeSelector:
                  beta.kubernetes.io/os: linux

Please let us know how to control this pod eviction when gpu node get scale down as these pods shows in running even after gpu node got removed from cluster.

shnigam2 commented 10 months ago

@shivamerla @cdesiniotis Please suggest on this

tariq1890 commented 10 months ago

@shnigam2 Can you share your gpu node yaml manifest?

shnigam2 commented 10 months ago

@tariq1890 Please find the manifest of GPU node when all nvidia pods are in running state:

k get po -n gpu-operator -o wide |grep -i ip-10-222-100-91.ec2.internal
gpu-feature-discovery-zzkqg                                  1/1     Running     0          6m44s   100.119.232.78   ip-10-222-100-91.ec2.internal    <none>           <none>
gpu-operator-node-feature-discovery-worker-2vqg7             1/1     Running     0          7m52s   100.119.232.69   ip-10-222-100-91.ec2.internal    <none>           <none>
nvidia-container-toolkit-daemonset-ksp5q                     1/1     Running     0          6m44s   100.119.232.73   ip-10-222-100-91.ec2.internal    <none>           <none>
nvidia-cuda-validator-ccgrb                                  0/1     Completed   0          5m13s   100.119.232.76   ip-10-222-100-91.ec2.internal    <none>           <none>
nvidia-dcgm-exporter-tjpz9                                   1/1     Running     0          6m44s   100.119.232.75   ip-10-222-100-91.ec2.internal    <none>           <none>
nvidia-device-plugin-daemonset-xc7rb                         1/1     Running     0          6m44s   100.119.232.77   ip-10-222-100-91.ec2.internal    <none>           <none>
nvidia-device-plugin-validator-c6qzp                         0/1     Completed   0          4m26s   100.119.232.79   ip-10-222-100-91.ec2.internal    <none>           <none>
nvidia-driver-daemonset-cxjdf                                1/1     Running     0          7m20s   100.119.232.72   ip-10-222-100-91.ec2.internal    <none>           <none>
nvidia-operator-validator-tq797                              1/1     Running     0          6m44s   100.119.232.74   ip-10-222-100-91.ec2.internal    <none>           <none>
k get nodes ip-10-222-100-91.ec2.internal -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"csi.oneagent.dynatrace.com":"ip-10-222-100-91.ec2.internal","csi.tigera.io":"ip-10-222-100-91.ec2.internal","ebs.csi.aws.com":"i-054d7daae0d81b5ec"}'
    kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
    nfd.node.kubernetes.io/extended-resources: ""
    nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-cpuid.AVX512VL,cpu-cpuid.AVX512VNNI,cpu-cpuid.CMPXCHG8,cpu-cpuid.FMA3,cpu-cpuid.FXSR,cpu-cpuid.FXSROPT,cpu-cpuid.HYPERVISOR,cpu-cpuid.LAHF,cpu-cpuid.MOVBE,cpu-cpuid.MPX,cpu-cpuid.OSXSAVE,cpu-cpuid.SYSCALL,cpu-cpuid.SYSEE,cpu-cpuid.X87,cpu-cpuid.XGETBV1,cpu-cpuid.XSAVE,cpu-cpuid.XSAVEC,cpu-cpuid.XSAVEOPT,cpu-cpuid.XSAVES,cpu-hardware_multithreading,cpu-model.family,cpu-model.id,cpu-model.vendor_id,kernel-config.NO_HZ,kernel-config.NO_HZ_IDLE,kernel-version.full,kernel-version.major,kernel-version.minor,kernel-version.revision,nvidia.com/cuda.driver.major,nvidia.com/cuda.driver.minor,nvidia.com/cuda.driver.rev,nvidia.com/cuda.runtime.major,nvidia.com/cuda.runtime.minor,nvidia.com/gfd.timestamp,nvidia.com/gpu.compute.major,nvidia.com/gpu.compute.minor,nvidia.com/gpu.count,nvidia.com/gpu.family,nvidia.com/gpu.machine,nvidia.com/gpu.memory,nvidia.com/gpu.product,nvidia.com/gpu.replicas,nvidia.com/mig.capable,nvidia.com/mig.strategy,pci-10de.present,pci-1d0f.present,storage-nonrotationaldisk,system-os_release.ID,system-os_release.VERSION_ID,system-os_release.VERSION_ID.major,system-os_release.VERSION_ID.minor
    nfd.node.kubernetes.io/worker.version: v0.12.1
    node.alpha.kubernetes.io/ttl: "0"
    projectcalico.org/IPv4Address: 10.222.100.91/24
    projectcalico.org/IPv4IPIPTunnelAddr: 100.119.232.64
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2023-09-23T02:36:25Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: g4dn.xlarge
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: us-east-1
    failure-domain.beta.kubernetes.io/zone: us-east-1a
    feature.node.kubernetes.io/cpu-cpuid.ADX: "true"
    feature.node.kubernetes.io/cpu-cpuid.AESNI: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX2: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512BW: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512CD: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512DQ: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512F: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512VL: "true"
    feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI: "true"
    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8: "true"
    feature.node.kubernetes.io/cpu-cpuid.FMA3: "true"
    feature.node.kubernetes.io/cpu-cpuid.FXSR: "true"
    feature.node.kubernetes.io/cpu-cpuid.FXSROPT: "true"
    feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR: "true"
    feature.node.kubernetes.io/cpu-cpuid.LAHF: "true"
    feature.node.kubernetes.io/cpu-cpuid.MOVBE: "true"
    feature.node.kubernetes.io/cpu-cpuid.MPX: "true"
    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE: "true"
    feature.node.kubernetes.io/cpu-cpuid.SYSCALL: "true"
    feature.node.kubernetes.io/cpu-cpuid.SYSEE: "true"
    feature.node.kubernetes.io/cpu-cpuid.X87: "true"
    feature.node.kubernetes.io/cpu-cpuid.XGETBV1: "true"
    feature.node.kubernetes.io/cpu-cpuid.XSAVE: "true"
    feature.node.kubernetes.io/cpu-cpuid.XSAVEC: "true"
    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT: "true"
    feature.node.kubernetes.io/cpu-cpuid.XSAVES: "true"
    feature.node.kubernetes.io/cpu-hardware_multithreading: "true"
    feature.node.kubernetes.io/cpu-model.family: "6"
    feature.node.kubernetes.io/cpu-model.id: "85"
    feature.node.kubernetes.io/cpu-model.vendor_id: Intel
    feature.node.kubernetes.io/kernel-config.NO_HZ: "true"
    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE: "true"
    feature.node.kubernetes.io/kernel-version.full: 5.15.125-flatcar
    feature.node.kubernetes.io/kernel-version.major: "5"
    feature.node.kubernetes.io/kernel-version.minor: "15"
    feature.node.kubernetes.io/kernel-version.revision: "125"
    feature.node.kubernetes.io/pci-10de.present: "true"
    feature.node.kubernetes.io/pci-1d0f.present: "true"
    feature.node.kubernetes.io/storage-nonrotationaldisk: "true"
    feature.node.kubernetes.io/system-os_release.ID: flatcar
    feature.node.kubernetes.io/system-os_release.VERSION_ID: 3510.2.7
    feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "3510"
    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "2"
    instance-group: cpu-g4dn-xlarge
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-10-222-100-91.ec2.internal
    kubernetes.io/os: linux
    kubernetes.io/role: node
    our-registry.cloud/gpu: "true"
    node-role.kubernetes.io/node: ""
    node.kubernetes.io/instance-type: g4dn.xlarge
    node.kubernetes.io/role: node
    nvidia.com/cuda.driver.major: "525"
    nvidia.com/cuda.driver.minor: "105"
    nvidia.com/cuda.driver.rev: "17"
    nvidia.com/cuda.runtime.major: "12"
    nvidia.com/cuda.runtime.minor: "0"
    nvidia.com/gfd.timestamp: "1695436816"
    nvidia.com/gpu.compute.major: "7"
    nvidia.com/gpu.compute.minor: "5"
    nvidia.com/gpu.count: "1"
    nvidia.com/gpu.deploy.container-toolkit: "true"
    nvidia.com/gpu.deploy.dcgm: "true"
    nvidia.com/gpu.deploy.dcgm-exporter: "true"
    nvidia.com/gpu.deploy.device-plugin: "true"
    nvidia.com/gpu.deploy.driver: "true"
    nvidia.com/gpu.deploy.gpu-feature-discovery: "true"
    nvidia.com/gpu.deploy.node-status-exporter: "true"
    nvidia.com/gpu.deploy.nvsm: ""
    nvidia.com/gpu.deploy.operator-validator: "true"
    nvidia.com/gpu.family: turing
    nvidia.com/gpu.machine: g4dn.xlarge
    nvidia.com/gpu.memory: "15360"
    nvidia.com/gpu.present: "true"
    nvidia.com/gpu.product: Tesla-T4
    nvidia.com/gpu.replicas: "1"
    nvidia.com/mig.capable: "false"
    nvidia.com/mig.strategy: single
    topology.ebs.csi.aws.com/zone: us-east-1a
    topology.kubernetes.io/region: us-east-1
    topology.kubernetes.io/zone: us-east-1a
  name: ip-10-222-100-91.ec2.internal
  resourceVersion: "36894521"
  uid: d5c9ddb2-3379-4c9f-942e-0b65d1162edb
spec:
  podCIDR: 100.96.37.0/24
  podCIDRs:
  - 100.96.37.0/24
  providerID: aws:///us-east-1a/i-054d7daae0d81b5ec
  taints:
  - effect: NoSchedule
    key: gpu.kubernetes.io/gpu-exists
status:
  addresses:
  - address: 10.222.100.91
    type: InternalIP
  - address: ip-10-222-100-91.ec2.internal
    type: Hostname
  - address: ip-10-222-100-91.ec2.internal
    type: InternalDNS
  allocatable:
    attachable-volumes-aws-ebs: "39"
    cpu: "4"
    ephemeral-storage: "88450615150"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 15980652Ki
    nvidia.com/gpu: "1"
    pods: "110"
  capacity:
    attachable-volumes-aws-ebs: "39"
    cpu: "4"
    ephemeral-storage: 95975060Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 16083052Ki
    nvidia.com/gpu: "1"
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2023-09-23T02:37:03Z"
    lastTransitionTime: "2023-09-23T02:37:03Z"
    message: Calico is running on this node
    reason: CalicoIsUp
    status: "False"
    type: NetworkUnavailable
  - lastHeartbeatTime: "2023-09-23T02:40:51Z"
    lastTransitionTime: "2023-09-23T02:36:25Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2023-09-23T02:40:51Z"
    lastTransitionTime: "2023-09-23T02:36:25Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2023-09-23T02:40:51Z"
    lastTransitionTime: "2023-09-23T02:36:25Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2023-09-23T02:40:51Z"
    lastTransitionTime: "2023-09-23T02:36:57Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  images:
  - names:
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/nvidia-kmods-driver-flatcar@sha256:3e83fc8abe394bb2a86577a2e936e425ec4c3952301cb12712f576ba2b642cb4
    sizeBytes: 1138988828
  - names:
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/dcgm-exporter@sha256:ae014d7f27c32ba83128ba31e2f8ab3a0910a46607e63d2ae7a90ae3551e3330
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
    sizeBytes: 1059498968
  - names:
    - our-registry-cngccp-docker.jfrog.io/splunk/fluentd-hec@sha256:9f6b4642a22f8942bb4d6c5357ee768fe515fa21d49577b88ba12098c382656b
    - our-registry-cngccp-docker.jfrog.io/splunk/fluentd-hec:1.2.8
    sizeBytes: 315828956
  - names:
    - xpj245675755234.live.dynatrace.com/linux/oneagent@sha256:a44033e943518221fd657d033845c12850ba872d9e61616c192f406919b87bb3
    - xpj245675755234.live.dynatrace.com/linux/oneagent:1.265.152
    sizeBytes: 227902134
  - names:
    - nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:cab21c93987a5c884075efe0fb4a8abaa1997e1696cbc773ba69889f42f8329b
    - nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.1
    sizeBytes: 213778085
  - names:
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/k8s-device-plugin@sha256:46ce950d29cd67351c37850cec6aafa718d346f181c956d73bec079f9d96fbc1
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/k8s-device-plugin:v0.14.0-ubi8
    sizeBytes: 165982184
  - names:
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-feature-discovery@sha256:b1c162fb5fce21a684b4e28dae2c37d60b2d3c47b7270dd0bce835b7ce9e5a24
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8
    sizeBytes: 162038014
  - names:
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator@sha256:f6bf463459a61aa67c5f9e4f4f97797609b85bf77aaef88b0e78536889a7e517
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:devel-ubi8
    sizeBytes: 141870962
  - names:
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit@sha256:91e028c8177b4896b7d79f08c64f3a84cb66a0f5a3f32b844d909ebbbd7e0369
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.13.0-ubuntu20.04
    sizeBytes: 127160969
  - names:
    - docker.io/calico/cni@sha256:9a2c99f0314053aa11e971bd5d72e17951767bf5c6ff1fd9c38c4582d7cb8a0a
    - docker.io/calico/cni:v3.25.1
    sizeBytes: 89884044
  - names:
    - docker.io/calico/node@sha256:0cd00e83d06b3af8cd712ad2c310be07b240235ad7ca1397e04eb14d20dcc20f
    - docker.io/calico/node:v3.25.1
    sizeBytes: 88335791
  - names:
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery@sha256:a498b39f2fd7435d8862a9a916ef6eb4d2a4d8d5b4c6788fb48bdb11b008e87a
    - our-registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery:v0.12.1
    sizeBytes: 73669012
  - names:
    - our-registry-cngccp-docker.jfrog.io/dynatrace/dynatrace-operator@sha256:ce621425125ba8fdcfa0f300c75e0167e9301a4654fcd1c14baa75f4d41151a3
    - our-registry-cngccp-docker.jfrog.io/dynatrace/dynatrace-operator:v0.9.1
    sizeBytes: 43133681
  - names:
    - public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver@sha256:2d1ecf57fcfde2403a66e7709ecbb55db6d2bfff64c5c71225c9fb101ffe9c30
    - public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.18.0
    sizeBytes: 30176686
  - names:
    - registry.k8s.io/kube-proxy@sha256:8d998d77a1fae5d933a7efea97faace684559d70a37a72dba7193ed84e1bc45d
    - registry.k8s.io/kube-proxy:v1.26.7
    sizeBytes: 21764578
  - names:
    - our-registry-cngccp-docker.jfrog.io/kube2iam@sha256:aba84ebec51b25a22ffbcf3fe1599dabb0c88d7de87f07f00b85b79ddd72d672
    - our-registry-cngccp-docker.jfrog.io/kube2iam:imdsv2-fix
    sizeBytes: 14666113
  - names:
    - docker.io/calico/node-driver-registrar@sha256:5954319e4dbf61aac2e704068e9f3cd083d67f630c08bc0d280863dbf01668bc
    - docker.io/calico/node-driver-registrar:v3.25.1
    sizeBytes: 11695360
  - names:
    - docker.io/calico/csi@sha256:1f17de674c15819408c02ea5699bc3afe75f3120fbaf9c23ad5bfa2bca01814c
    - docker.io/calico/csi:v3.25.1
    sizeBytes: 11053330
  - names:
    - docker.io/calico/pod2daemon-flexvol@sha256:66629150669c4ff7f70832858af28139407da59f61451a8658f15f06b9f20436
    - docker.io/calico/pod2daemon-flexvol:v3.25.1
    sizeBytes: 7167792
  - names:
    - public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar@sha256:6ad0cae2ae91453f283a44e9b430e475b8a9fa3d606aec9a8b09596fffbcd2c9
    - public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.7.0-eks-1-26-7
    sizeBytes: 6560300
  - names:
    - public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe@sha256:d9e11b42ae5f4f2f7ea9034e68040997cdbb04ae9e188aa897f76ae92698d78a
    - public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.9.0-eks-1-26-7
    sizeBytes: 6086054
  - names:
    - our-registry-cngccp-docker-k8s.jfrog.io/logrotate@sha256:26454d4621f3ed8c1d048fbc3a25b31a00f45a4404c1d3716845cb154b571e3e
    - our-registry-cngccp-docker-k8s.jfrog.io/logrotate:1.0_5469f66
    sizeBytes: 5572108
  - names:
    - registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db
    - registry.k8s.io/pause:3.6
    sizeBytes: 301773
  nodeInfo:
    architecture: amd64
    bootID: 1e548d4e-2bf3-4de0-ae7f-017980214214
    containerRuntimeVersion: containerd://1.6.16
    kernelVersion: 5.15.125-flatcar
    kubeProxyVersion: v1.26.7
    kubeletVersion: v1.26.7
    machineID: ec27866d6a0f6aeff75511a5668b6a78
    operatingSystem: linux
    osImage: Flatcar Container Linux by Kinvolk 3510.2.7 (Oklo)
    systemUUID: ec27866d-6a0f-6aef-f755-11a5668b6a78
tariq1890 commented 10 months ago

How are you draining these nodes?

Please ensure --ignore-daemonsets is set to false when running the kubectl drain command

shnigam2 commented 10 months ago

@tariq1890 Like direct termination of backend ec2 instance and it was removing all these nvidia pods till k8s 1.24, but on k8s 1.26 version these 4 pods shows running even underline instance was already removed. Any parameter which we need to pass for k8s 1.26.

shnigam2 commented 9 months ago

@tariq1890 @cdesiniotis @shivamerla please let me know how to fix this issue. Daemonsets are not getting scaled down on node termination by cluster autoscaler. This would ideally removed all nvidia damonset pods on node removal which is not happening in our case.

shnigam2 commented 9 months ago

@shivamerla could you please help us to understand the cause of such behavior we are using flatcar as worker node.

shnigam2 commented 8 months ago

@shivamerla @tariq1890 @cdesiniotis Could you please help us in fixing this behaviour , due to this unnecessarly showing pods in namepace which actually not exist as node already got scaled down.

shivamerla commented 8 months ago

@shnigam2 Can you provide logs from the k8s controller-manager pod to check for errors on cleaning up these pods? Are you using images from private registry(i.e using pullSecrets)?

shnigam2 commented 8 months ago

@shivamerla Yes we are using private registry , please find the controller-manager logs for errors:

I1110 02:29:52.043817       1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/gpu-feature-discovery-krj5j"
E1110 02:29:52.048104       1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/gpu-feature-discovery-krj5j; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.048189       1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-driver-daemonset-vzcrj"
E1110 02:29:52.057008       1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-driver-daemonset-vzcrj; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.057063       1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-device-plugin-daemonset-xztw8"
E1110 02:29:52.061290       1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-device-plugin-daemonset-xztw8; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.061316       1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-device-plugin-daemonset-lwhk6"
E1110 02:29:52.065459       1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-device-plugin-daemonset-lwhk6; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.065625       1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-container-toolkit-daemonset-fzg45"
E1110 02:29:52.071929       1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-container-toolkit-daemonset-fzg45; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.071967       1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-container-toolkit-daemonset-wdlkq"
E1110 02:29:52.076635       1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-container-toolkit-daemonset-wdlkq; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.076784       1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/gpu-feature-discovery-bh6jn"
E1110 02:29:52.080977       1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/gpu-feature-discovery-bh6jn; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.081028       1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-driver-daemonset-8vv6j"
E1110 02:29:52.085230       1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-driver-daemonset-8vv6j; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
shnigam2 commented 8 months ago

@shivamerla can you plz check and help on this

shivamerla commented 8 months ago

@shnigam2 we have a known issue which will be fixed in next patch v23.9.1 (later this month). The problem is we are adding duplicate pullSecrets in the spec. You can avoid this by not specifying the pullSecret in ClusterPolicy for validator image. We use the validator image as initContainer and thus ended up adding the same secret twice for initContainer as well as main container in every Daemonset.