Open KeyboardDabbler opened 1 month ago
Can you try to get the crash log with kubectl logs <podname> --previous
?
Can you try to get the crash log with
kubectl logs <podname> --previous
?
v1.25.2.3
kubectl -n kube-system get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
amd-device-plugin-b5gsh 1/1 Running 0 18h 10.69.2.122 black-knight-02 <none> <none>
amd-device-plugin-d5rrd 1/1 Running 0 18h 10.69.0.180 black-knight-03 <none> <none>
amd-device-plugin-sf25x 1/1 Running 0 18h 10.69.1.42 black-knight-01 <none> <none>
amd-gpu-node-labeller-g8ntt 1/1 Running 0 12h 10.69.1.30 black-knight-01 <none> <none>
amd-gpu-node-labeller-xqvf8 1/1 Running 0 12h 10.69.0.220 black-knight-03 <none> <none>
amd-gpu-node-labeller-zz7wk 1/1 Running 0 12h 10.69.2.227 black-knight-02 <none> <none>
[Docker] ❯ kubectl logs amd-device-plugin-d5rrd -n kube-system
I1009 04:38:33.064708 1 main.go:305] AMD GPU device plugin for Kubernetes
I1009 04:38:33.064751 1 main.go:305] ./k8s-device-plugin version v1.18.1-20-gb8f1ee8
I1009 04:38:33.064756 1 main.go:305] hwloc: _VERSION: 2.9.1, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I1009 04:38:33.064770 1 manager.go:42] Starting device plugin manager
I1009 04:38:33.064777 1 manager.go:46] Registering for system signal notifications
I1009 04:38:33.064892 1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I1009 04:38:33.064946 1 manager.go:60] Starting Discovery on new plugins
I1009 04:38:33.064955 1 manager.go:66] Handling incoming signals
I1009 04:38:33.064964 1 manager.go:71] Received new list of plugins: [gpu]
I1009 04:38:33.065009 1 manager.go:110] Adding a new plugin "gpu"
I1009 04:38:33.065018 1 plugin.go:64] gpu: Starting plugin server
I1009 04:38:33.065023 1 plugin.go:94] gpu: Starting the DPI gRPC server
I1009 04:38:33.065321 1 plugin.go:112] gpu: Serving requests...
I1009 04:38:43.067432 1 plugin.go:128] gpu: Registering the DPI with Kubelet
I1009 04:38:43.068096 1 plugin.go:140] gpu: Registration for endpoint amd.com_gpu
I1009 04:38:43.069980 1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:e5:00.0
I1009 04:38:43.209343 1 main.go:149] Watching GPU with bus ID: 0000:e5:00.0 NUMA Node: []
E1009 04:38:43.209357 1 main.go:151] No NUMA node found with bus ID: 0000:e5:00.0
v1.25.2.8
kubectl -n kube-system get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
amd-device-plugin-5htvl 1/1 Running 0 5s 10.69.2.71 black-knight-02 <none> <none>
amd-device-plugin-gw4d4 0/1 CrashLoopBackOff 4 (58s ago) 2m23s 10.69.0.161 black-knight-03 <none> <none>
amd-device-plugin-sf25x 1/1 Running 0 18h 10.69.1.42 black-knight-01 <none> <none>
amd-gpu-node-labeller-g8ntt 1/1 Running 0 12h 10.69.1.30 black-knight-01 <none> <none>
amd-gpu-node-labeller-xqvf8 1/1 Running 0 12h 10.69.0.220 black-knight-03 <none> <none>
amd-gpu-node-labeller-zz7wk 1/1 Running 0 12h 10.69.2.227 black-knight-02 <none> <none>
⬢ [Docker] ❯ kubectl logs amd-device-plugin-gw4d4 -n kube-system --previous
exec ./k8s-device-plugin: exec format error
ah ok... sorry I misunderstood what you had before. This is weird... let me look into it.
Can you help narrow down the start of the issue a bit? i.e. do you see the same issue with 1.25.2.4 and .5? (I don't have a Talos setup to reproduce and I am able to use the plugin tip of tree.)
Can you help narrow down the start of the issue a bit? i.e. do you see the same issue with 1.25.2.4 and .5? (I don't have a Talos setup to reproduce and I am able to use the plugin tip of tree.)
Sorry i thought i tested v1.25.2.4, but i suspect i didn't allow enough time for flux to update the commit.
I have now tried the following tags v1.25.2.2 ✔ v1.25.2.3 ✔ v1.25.2.4 ✔ v1.25.2.5 ✔ v1.25.2.6 ✔ v1.25.2.7 ✔ v1.25.2.8 ✖ (failing on node 3)
Thanks for looking into this. Hopefully, this helps narrow down the issue to the latest changes!
Um... I am not able to reproduce in my setup:
Actually, I just noticed this... you said it is failing on node 3, does that means the plugin is working in other nodes? If that's the case, this doesn't seem like a plugin issue.
Problem Description
I have 3 nodes, all the same hardware spec. Running kubernetes on Talos, deployed amd-device-plugin using helm chart and demonset. On tag v1.25.2.3 everything works, each node has access to the iGPU and can be assigned to a pod.
kubectl -n kube-system get pods -o wide
When i attempt to upgrade to any tag greater than 1.25.2.3. amd-device-plugin fails to deploy on node 3. From what I can tell the image is detecting the wrong system architect?
kubectl -n kube-system get pods -o wide
kubectl describe pod amd-device-plugin-6h7tt -n kube-system
kubectl -n kube-system logs amd-device-plugin-6h7tt -f
talosctl dmesg -n black-knight-02 | grep -i amdgpu
Operating System
Talos v1.8.0
CPU
AMD 6850U CPU with Radeon Graphics
GPU
AMD Radeon VII
ROCm Version
ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
Upgrade docker.io/rocm/k8s-device-plugin ( 1.25.2.3 → 1.25.2.8).
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
kubectl get nodes -o wide
kubectl version
kubectl get no -o json | jq ".items[].metadata.labels"
kubectl get nodes -o=jsonpath='{.items[*].status.nodeInfo.architecture}'