Open shnigam2 opened 10 months ago
@shivamerla @cdesiniotis Please suggest on this
@shnigam2 Can you share your gpu node yaml manifest?
@tariq1890 Please find the manifest of GPU node when all nvidia pods are in running state:
k get po -n gpu-operator -o wide |grep -i ip-10-222-100-91.ec2.internal
gpu-feature-discovery-zzkqg 1/1 Running 0 6m44s 100.119.232.78 ip-10-222-100-91.ec2.internal <none> <none>
gpu-operator-node-feature-discovery-worker-2vqg7 1/1 Running 0 7m52s 100.119.232.69 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-container-toolkit-daemonset-ksp5q 1/1 Running 0 6m44s 100.119.232.73 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-cuda-validator-ccgrb 0/1 Completed 0 5m13s 100.119.232.76 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-dcgm-exporter-tjpz9 1/1 Running 0 6m44s 100.119.232.75 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-device-plugin-daemonset-xc7rb 1/1 Running 0 6m44s 100.119.232.77 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-device-plugin-validator-c6qzp 0/1 Completed 0 4m26s 100.119.232.79 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-driver-daemonset-cxjdf 1/1 Running 0 7m20s 100.119.232.72 ip-10-222-100-91.ec2.internal <none> <none>
nvidia-operator-validator-tq797 1/1 Running 0 6m44s 100.119.232.74 ip-10-222-100-91.ec2.internal <none> <none>
k get nodes ip-10-222-100-91.ec2.internal -o yaml
apiVersion: v1
kind: Node
metadata:
annotations:
csi.volume.kubernetes.io/nodeid: '{"csi.oneagent.dynatrace.com":"ip-10-222-100-91.ec2.internal","csi.tigera.io":"ip-10-222-100-91.ec2.internal","ebs.csi.aws.com":"i-054d7daae0d81b5ec"}'
kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
nfd.node.kubernetes.io/extended-resources: ""
nfd.node.kubernetes.io/feature-labels: cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-cpuid.AVX512VL,cpu-cpuid.AVX512VNNI,cpu-cpuid.CMPXCHG8,cpu-cpuid.FMA3,cpu-cpuid.FXSR,cpu-cpuid.FXSROPT,cpu-cpuid.HYPERVISOR,cpu-cpuid.LAHF,cpu-cpuid.MOVBE,cpu-cpuid.MPX,cpu-cpuid.OSXSAVE,cpu-cpuid.SYSCALL,cpu-cpuid.SYSEE,cpu-cpuid.X87,cpu-cpuid.XGETBV1,cpu-cpuid.XSAVE,cpu-cpuid.XSAVEC,cpu-cpuid.XSAVEOPT,cpu-cpuid.XSAVES,cpu-hardware_multithreading,cpu-model.family,cpu-model.id,cpu-model.vendor_id,kernel-config.NO_HZ,kernel-config.NO_HZ_IDLE,kernel-version.full,kernel-version.major,kernel-version.minor,kernel-version.revision,nvidia.com/cuda.driver.major,nvidia.com/cuda.driver.minor,nvidia.com/cuda.driver.rev,nvidia.com/cuda.runtime.major,nvidia.com/cuda.runtime.minor,nvidia.com/gfd.timestamp,nvidia.com/gpu.compute.major,nvidia.com/gpu.compute.minor,nvidia.com/gpu.count,nvidia.com/gpu.family,nvidia.com/gpu.machine,nvidia.com/gpu.memory,nvidia.com/gpu.product,nvidia.com/gpu.replicas,nvidia.com/mig.capable,nvidia.com/mig.strategy,pci-10de.present,pci-1d0f.present,storage-nonrotationaldisk,system-os_release.ID,system-os_release.VERSION_ID,system-os_release.VERSION_ID.major,system-os_release.VERSION_ID.minor
nfd.node.kubernetes.io/worker.version: v0.12.1
node.alpha.kubernetes.io/ttl: "0"
projectcalico.org/IPv4Address: 10.222.100.91/24
projectcalico.org/IPv4IPIPTunnelAddr: 100.119.232.64
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2023-09-23T02:36:25Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: g4dn.xlarge
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: us-east-1
failure-domain.beta.kubernetes.io/zone: us-east-1a
feature.node.kubernetes.io/cpu-cpuid.ADX: "true"
feature.node.kubernetes.io/cpu-cpuid.AESNI: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX2: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512BW: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512CD: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512F: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512VL: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI: "true"
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8: "true"
feature.node.kubernetes.io/cpu-cpuid.FMA3: "true"
feature.node.kubernetes.io/cpu-cpuid.FXSR: "true"
feature.node.kubernetes.io/cpu-cpuid.FXSROPT: "true"
feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR: "true"
feature.node.kubernetes.io/cpu-cpuid.LAHF: "true"
feature.node.kubernetes.io/cpu-cpuid.MOVBE: "true"
feature.node.kubernetes.io/cpu-cpuid.MPX: "true"
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE: "true"
feature.node.kubernetes.io/cpu-cpuid.SYSCALL: "true"
feature.node.kubernetes.io/cpu-cpuid.SYSEE: "true"
feature.node.kubernetes.io/cpu-cpuid.X87: "true"
feature.node.kubernetes.io/cpu-cpuid.XGETBV1: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVE: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVEC: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT: "true"
feature.node.kubernetes.io/cpu-cpuid.XSAVES: "true"
feature.node.kubernetes.io/cpu-hardware_multithreading: "true"
feature.node.kubernetes.io/cpu-model.family: "6"
feature.node.kubernetes.io/cpu-model.id: "85"
feature.node.kubernetes.io/cpu-model.vendor_id: Intel
feature.node.kubernetes.io/kernel-config.NO_HZ: "true"
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE: "true"
feature.node.kubernetes.io/kernel-version.full: 5.15.125-flatcar
feature.node.kubernetes.io/kernel-version.major: "5"
feature.node.kubernetes.io/kernel-version.minor: "15"
feature.node.kubernetes.io/kernel-version.revision: "125"
feature.node.kubernetes.io/pci-10de.present: "true"
feature.node.kubernetes.io/pci-1d0f.present: "true"
feature.node.kubernetes.io/storage-nonrotationaldisk: "true"
feature.node.kubernetes.io/system-os_release.ID: flatcar
feature.node.kubernetes.io/system-os_release.VERSION_ID: 3510.2.7
feature.node.kubernetes.io/system-os_release.VERSION_ID.major: "3510"
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor: "2"
instance-group: cpu-g4dn-xlarge
kubernetes.io/arch: amd64
kubernetes.io/hostname: ip-10-222-100-91.ec2.internal
kubernetes.io/os: linux
kubernetes.io/role: node
our-registry.cloud/gpu: "true"
node-role.kubernetes.io/node: ""
node.kubernetes.io/instance-type: g4dn.xlarge
node.kubernetes.io/role: node
nvidia.com/cuda.driver.major: "525"
nvidia.com/cuda.driver.minor: "105"
nvidia.com/cuda.driver.rev: "17"
nvidia.com/cuda.runtime.major: "12"
nvidia.com/cuda.runtime.minor: "0"
nvidia.com/gfd.timestamp: "1695436816"
nvidia.com/gpu.compute.major: "7"
nvidia.com/gpu.compute.minor: "5"
nvidia.com/gpu.count: "1"
nvidia.com/gpu.deploy.container-toolkit: "true"
nvidia.com/gpu.deploy.dcgm: "true"
nvidia.com/gpu.deploy.dcgm-exporter: "true"
nvidia.com/gpu.deploy.device-plugin: "true"
nvidia.com/gpu.deploy.driver: "true"
nvidia.com/gpu.deploy.gpu-feature-discovery: "true"
nvidia.com/gpu.deploy.node-status-exporter: "true"
nvidia.com/gpu.deploy.nvsm: ""
nvidia.com/gpu.deploy.operator-validator: "true"
nvidia.com/gpu.family: turing
nvidia.com/gpu.machine: g4dn.xlarge
nvidia.com/gpu.memory: "15360"
nvidia.com/gpu.present: "true"
nvidia.com/gpu.product: Tesla-T4
nvidia.com/gpu.replicas: "1"
nvidia.com/mig.capable: "false"
nvidia.com/mig.strategy: single
topology.ebs.csi.aws.com/zone: us-east-1a
topology.kubernetes.io/region: us-east-1
topology.kubernetes.io/zone: us-east-1a
name: ip-10-222-100-91.ec2.internal
resourceVersion: "36894521"
uid: d5c9ddb2-3379-4c9f-942e-0b65d1162edb
spec:
podCIDR: 100.96.37.0/24
podCIDRs:
- 100.96.37.0/24
providerID: aws:///us-east-1a/i-054d7daae0d81b5ec
taints:
- effect: NoSchedule
key: gpu.kubernetes.io/gpu-exists
status:
addresses:
- address: 10.222.100.91
type: InternalIP
- address: ip-10-222-100-91.ec2.internal
type: Hostname
- address: ip-10-222-100-91.ec2.internal
type: InternalDNS
allocatable:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: "88450615150"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 15980652Ki
nvidia.com/gpu: "1"
pods: "110"
capacity:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: 95975060Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 16083052Ki
nvidia.com/gpu: "1"
pods: "110"
conditions:
- lastHeartbeatTime: "2023-09-23T02:37:03Z"
lastTransitionTime: "2023-09-23T02:37:03Z"
message: Calico is running on this node
reason: CalicoIsUp
status: "False"
type: NetworkUnavailable
- lastHeartbeatTime: "2023-09-23T02:40:51Z"
lastTransitionTime: "2023-09-23T02:36:25Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2023-09-23T02:40:51Z"
lastTransitionTime: "2023-09-23T02:36:25Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2023-09-23T02:40:51Z"
lastTransitionTime: "2023-09-23T02:36:25Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2023-09-23T02:40:51Z"
lastTransitionTime: "2023-09-23T02:36:57Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/nvidia-kmods-driver-flatcar@sha256:3e83fc8abe394bb2a86577a2e936e425ec4c3952301cb12712f576ba2b642cb4
sizeBytes: 1138988828
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/dcgm-exporter@sha256:ae014d7f27c32ba83128ba31e2f8ab3a0910a46607e63d2ae7a90ae3551e3330
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/dcgm-exporter:3.1.7-3.1.4-ubuntu20.04
sizeBytes: 1059498968
- names:
- our-registry-cngccp-docker.jfrog.io/splunk/fluentd-hec@sha256:9f6b4642a22f8942bb4d6c5357ee768fe515fa21d49577b88ba12098c382656b
- our-registry-cngccp-docker.jfrog.io/splunk/fluentd-hec:1.2.8
sizeBytes: 315828956
- names:
- xpj245675755234.live.dynatrace.com/linux/oneagent@sha256:a44033e943518221fd657d033845c12850ba872d9e61616c192f406919b87bb3
- xpj245675755234.live.dynatrace.com/linux/oneagent:1.265.152
sizeBytes: 227902134
- names:
- nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:cab21c93987a5c884075efe0fb4a8abaa1997e1696cbc773ba69889f42f8329b
- nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.1
sizeBytes: 213778085
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/k8s-device-plugin@sha256:46ce950d29cd67351c37850cec6aafa718d346f181c956d73bec079f9d96fbc1
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/k8s-device-plugin:v0.14.0-ubi8
sizeBytes: 165982184
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-feature-discovery@sha256:b1c162fb5fce21a684b4e28dae2c37d60b2d3c47b7270dd0bce835b7ce9e5a24
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-feature-discovery:v0.8.0-ubi8
sizeBytes: 162038014
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator@sha256:f6bf463459a61aa67c5f9e4f4f97797609b85bf77aaef88b0e78536889a7e517
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/gpu-operator-validator:devel-ubi8
sizeBytes: 141870962
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit@sha256:91e028c8177b4896b7d79f08c64f3a84cb66a0f5a3f32b844d909ebbbd7e0369
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/container-toolkit:v1.13.0-ubuntu20.04
sizeBytes: 127160969
- names:
- docker.io/calico/cni@sha256:9a2c99f0314053aa11e971bd5d72e17951767bf5c6ff1fd9c38c4582d7cb8a0a
- docker.io/calico/cni:v3.25.1
sizeBytes: 89884044
- names:
- docker.io/calico/node@sha256:0cd00e83d06b3af8cd712ad2c310be07b240235ad7ca1397e04eb14d20dcc20f
- docker.io/calico/node:v3.25.1
sizeBytes: 88335791
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery@sha256:a498b39f2fd7435d8862a9a916ef6eb4d2a4d8d5b4c6788fb48bdb11b008e87a
- our-registry-cngccp-docker-k8s.jfrog.io/nvidia/node-feature-discovery:v0.12.1
sizeBytes: 73669012
- names:
- our-registry-cngccp-docker.jfrog.io/dynatrace/dynatrace-operator@sha256:ce621425125ba8fdcfa0f300c75e0167e9301a4654fcd1c14baa75f4d41151a3
- our-registry-cngccp-docker.jfrog.io/dynatrace/dynatrace-operator:v0.9.1
sizeBytes: 43133681
- names:
- public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver@sha256:2d1ecf57fcfde2403a66e7709ecbb55db6d2bfff64c5c71225c9fb101ffe9c30
- public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.18.0
sizeBytes: 30176686
- names:
- registry.k8s.io/kube-proxy@sha256:8d998d77a1fae5d933a7efea97faace684559d70a37a72dba7193ed84e1bc45d
- registry.k8s.io/kube-proxy:v1.26.7
sizeBytes: 21764578
- names:
- our-registry-cngccp-docker.jfrog.io/kube2iam@sha256:aba84ebec51b25a22ffbcf3fe1599dabb0c88d7de87f07f00b85b79ddd72d672
- our-registry-cngccp-docker.jfrog.io/kube2iam:imdsv2-fix
sizeBytes: 14666113
- names:
- docker.io/calico/node-driver-registrar@sha256:5954319e4dbf61aac2e704068e9f3cd083d67f630c08bc0d280863dbf01668bc
- docker.io/calico/node-driver-registrar:v3.25.1
sizeBytes: 11695360
- names:
- docker.io/calico/csi@sha256:1f17de674c15819408c02ea5699bc3afe75f3120fbaf9c23ad5bfa2bca01814c
- docker.io/calico/csi:v3.25.1
sizeBytes: 11053330
- names:
- docker.io/calico/pod2daemon-flexvol@sha256:66629150669c4ff7f70832858af28139407da59f61451a8658f15f06b9f20436
- docker.io/calico/pod2daemon-flexvol:v3.25.1
sizeBytes: 7167792
- names:
- public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar@sha256:6ad0cae2ae91453f283a44e9b430e475b8a9fa3d606aec9a8b09596fffbcd2c9
- public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.7.0-eks-1-26-7
sizeBytes: 6560300
- names:
- public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe@sha256:d9e11b42ae5f4f2f7ea9034e68040997cdbb04ae9e188aa897f76ae92698d78a
- public.ecr.aws/eks-distro/kubernetes-csi/livenessprobe:v2.9.0-eks-1-26-7
sizeBytes: 6086054
- names:
- our-registry-cngccp-docker-k8s.jfrog.io/logrotate@sha256:26454d4621f3ed8c1d048fbc3a25b31a00f45a4404c1d3716845cb154b571e3e
- our-registry-cngccp-docker-k8s.jfrog.io/logrotate:1.0_5469f66
sizeBytes: 5572108
- names:
- registry.k8s.io/pause@sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db
- registry.k8s.io/pause:3.6
sizeBytes: 301773
nodeInfo:
architecture: amd64
bootID: 1e548d4e-2bf3-4de0-ae7f-017980214214
containerRuntimeVersion: containerd://1.6.16
kernelVersion: 5.15.125-flatcar
kubeProxyVersion: v1.26.7
kubeletVersion: v1.26.7
machineID: ec27866d6a0f6aeff75511a5668b6a78
operatingSystem: linux
osImage: Flatcar Container Linux by Kinvolk 3510.2.7 (Oklo)
systemUUID: ec27866d-6a0f-6aef-f755-11a5668b6a78
How are you draining these nodes?
Please ensure --ignore-daemonsets
is set to false when running the kubectl drain command
@tariq1890 Like direct termination of backend ec2 instance and it was removing all these nvidia pods till k8s 1.24, but on k8s 1.26 version these 4 pods shows running even underline instance was already removed. Any parameter which we need to pass for k8s 1.26.
@tariq1890 @cdesiniotis @shivamerla please let me know how to fix this issue. Daemonsets are not getting scaled down on node termination by cluster autoscaler. This would ideally removed all nvidia damonset pods on node removal which is not happening in our case.
@shivamerla could you please help us to understand the cause of such behavior we are using flatcar as worker node.
@shivamerla @tariq1890 @cdesiniotis Could you please help us in fixing this behaviour , due to this unnecessarly showing pods in namepace which actually not exist as node already got scaled down.
@shnigam2 Can you provide logs from the k8s controller-manager pod to check for errors on cleaning up these pods? Are you using images from private registry(i.e using pullSecrets)?
@shivamerla Yes we are using private registry , please find the controller-manager logs for errors:
I1110 02:29:52.043817 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/gpu-feature-discovery-krj5j"
E1110 02:29:52.048104 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/gpu-feature-discovery-krj5j; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.048189 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-driver-daemonset-vzcrj"
E1110 02:29:52.057008 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-driver-daemonset-vzcrj; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.057063 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-device-plugin-daemonset-xztw8"
E1110 02:29:52.061290 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-device-plugin-daemonset-xztw8; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.061316 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-device-plugin-daemonset-lwhk6"
E1110 02:29:52.065459 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-device-plugin-daemonset-lwhk6; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.065625 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-container-toolkit-daemonset-fzg45"
E1110 02:29:52.071929 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-container-toolkit-daemonset-fzg45; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.071967 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-container-toolkit-daemonset-wdlkq"
E1110 02:29:52.076635 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-container-toolkit-daemonset-wdlkq; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.076784 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/gpu-feature-discovery-bh6jn"
E1110 02:29:52.080977 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/gpu-feature-discovery-bh6jn; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
I1110 02:29:52.081028 1 gc_controller.go:329] "PodGC is force deleting Pod" pod="gpu-operator/nvidia-driver-daemonset-8vv6j"
E1110 02:29:52.085230 1 gc_controller.go:255] failed to create manager for existing fields: failed to convert new object (gpu-operator/nvidia-driver-daemonset-8vv6j; /v1, Kind=Pod) to smd typed: .spec.imagePullSecrets: duplicate entries for key [name="jfrog-auth"]
@shivamerla can you plz check and help on this
@shnigam2 we have a known issue which will be fixed in next patch v23.9.1 (later this month). The problem is we are adding duplicate pullSecrets in the spec. You can avoid this by not specifying the pullSecret in ClusterPolicy
for validator image. We use the validator image as initContainer
and thus ended up adding the same secret twice for initContainer
as well as main
container in every Daemonset.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster. In description of these pods shows :-
Logs of k8s-driver-manager before terminating gpu node
Value which we are passing for helm :-
Please let us know how to control this pod eviction when gpu node get scale down as these pods shows in running even after gpu node got removed from cluster.