ROCm / k8s-device-plugin

Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster
Apache License 2.0
269 stars 47 forks source link

[Issue]: Unable to Update ( 1.25.2.7 → 1.25.2.8 ) #76

Open KeyboardDabbler opened 3 days ago

KeyboardDabbler commented 3 days ago

Problem Description

I have 3 nodes, all the same hardware spec. Running kubernetes on Talos, deployed amd-device-plugin using helm chart and demonset. On tag v1.25.2.3 everything works, each node has access to the iGPU and can be assigned to a pod.

kubectl -n kube-system get pods -o wide

NAME                                             READY   STATUS    RESTARTS       AGE   IP            NODE              NOMINATED NODE   READINESS GATES
amd-device-plugin-b5gsh                          1/1     Running   0              15h   10.69.2.122   black-knight-02   <none>           <none>
amd-device-plugin-d5rrd                          1/1     Running   0              15h   10.69.0.180   black-knight-03   <none>           <none>
amd-device-plugin-sf25x                          1/1     Running   0              15h   10.69.1.42    black-knight-01   <none>           <none>
amd-gpu-node-labeller-g8ntt                      1/1     Running   0              8h    10.69.1.30    black-knight-01   <none>           <none>
amd-gpu-node-labeller-xqvf8                      1/1     Running   0              8h    10.69.0.220   black-knight-03   <none>           <none>
amd-gpu-node-labeller-zz7wk                      1/1     Running   0              8h    10.69.2.227   black-knight-02   <none>           <none>

When i attempt to upgrade to any tag greater than 1.25.2.3. amd-device-plugin fails to deploy on node 3. From what I can tell the image is detecting the wrong system architect?

kubectl -n kube-system get pods -o wide

NAME                                             READY   STATUS             RESTARTS        AGE     IP            NODE              NOMINATED NODE   READINESS GATES
amd-device-plugin-6h7tt                          0/1     CrashLoopBackOff   2 (15s ago)     29s     10.69.0.201   black-knight-03   <none>           <none>
amd-device-plugin-l956d                          1/1     Running            0               29s     10.69.2.219   black-knight-02   <none>           <none>
amd-device-plugin-nqv5f                          1/1     Running            0               29s     10.69.1.219   black-knight-01   <none>           <none>
amd-gpu-node-labeller-h2l6w                      1/1     Running            0               29s     10.69.0.25    black-knight-03   <none>           <none>
amd-gpu-node-labeller-kzw9q                      1/1     Running            0               29s     10.69.2.157   black-knight-02   <none>           <none>
amd-gpu-node-labeller-sdhg5                      1/1     Running            0               29s     10.69.1.9     black-knight-01   <none>           <none>

kubectl describe pod amd-device-plugin-6h7tt -n kube-system

Name:                 amd-device-plugin-6h7tt
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      default
Node:                 black-knight-03/10.0.10.27
Start Time:           Tue, 08 Oct 2024 09:19:09 +0000
Labels:               app.kubernetes.io/component=amd-device-plugin
                      app.kubernetes.io/instance=amd-device-plugin
                      app.kubernetes.io/name=amd-device-plugin
                      controller-revision-hash=599d6ffccd
                      pod-template-generation=34
Annotations:          <none>
Status:               Running
IP:                   10.69.0.201
IPs:
  IP:           10.69.0.201
Controlled By:  DaemonSet/amd-device-plugin
Containers:
  app:
    Container ID:  containerd://aa040e5b78f93ad1bb16b2d032348941f0f10de1a71c347b66cc313a74be9e1a
    Image:         docker.io/rocm/k8s-device-plugin:1.25.2.8
    Image ID:      docker.io/rocm/k8s-device-plugin@sha256:f3835498cf2274e0a07c32b38c166c05a876f8eb776d756cc06805e599a3ba5f
    Port:          <none>
    Host Port:     <none>
    Command:
      ./k8s-device-plugin
    Args:
      -logtostderr=true
      -stderrthreshold=INFO
      -v=5
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Tue, 08 Oct 2024 09:19:51 +0000
      Finished:     Tue, 08 Oct 2024 09:19:51 +0000
    Ready:          False
    Restart Count:  3
    Limits:
      memory:  100Mi
    Requests:
      cpu:     10m
      memory:  10Mi
    Environment:
      TZ:  Pacific/Auckland
    Mounts:
      /sys from sys (rw)
      /var/lib/kubelet/device-plugins from device-plugins (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xrdfd (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
Volumes:
  device-plugins:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:
  kube-api-access-xrdfd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              feature.node.kubernetes.io/pci-0300_1002.present=true
                             kubernetes.io/arch=amd64
Tolerations:                 CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  60s                default-scheduler  Successfully assigned kube-system/amd-device-plugin-6h7tt to black-knight-03
  Normal   Pulled     19s (x4 over 60s)  kubelet            Container image "docker.io/rocm/k8s-device-plugin:1.25.2.8" already present on machine
  Normal   Created    19s (x4 over 60s)  kubelet            Created container app
  Normal   Started    19s (x4 over 60s)  kubelet            Started container app
  Warning  BackOff    7s (x5 over 58s)   kubelet            Back-off restarting failed container app in pod amd-device-plugin-6h7tt_kube-system(1d6ae128-c781-41ab-b106-e659b1464cfa)

kubectl -n kube-system logs amd-device-plugin-6h7tt -f

exec ./k8s-device-plugin: exec format error
kubectl describe daemonset amd-device-plugin -n kube-system

Name:           amd-device-plugin
Selector:       app.kubernetes.io/component=amd-device-plugin,app.kubernetes.io/instance=amd-device-plugin,app.kubernetes.io/name=amd-device-plugin
Node-Selector:  feature.node.kubernetes.io/pci-0300_1002.present=true,kubernetes.io/arch=amd64
Labels:         app.kubernetes.io/component=amd-device-plugin
                app.kubernetes.io/instance=amd-device-plugin
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=amd-device-plugin
                helm.sh/chart=app-template-3.5.0
                helm.toolkit.fluxcd.io/name=amd-device-plugin
                helm.toolkit.fluxcd.io/namespace=kube-system
Annotations:    deprecated.daemonset.template.generation: 34
                meta.helm.sh/release-name: amd-device-plugin
                meta.helm.sh/release-namespace: kube-system
Desired Number of Nodes Scheduled: 3
Current Number of Nodes Scheduled: 3
Number of Nodes Scheduled with Up-to-date Pods: 3
Number of Nodes Scheduled with Available Pods: 2
Number of Nodes Misscheduled: 0
Pods Status:  3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app.kubernetes.io/component=amd-device-plugin
                    app.kubernetes.io/instance=amd-device-plugin
                    app.kubernetes.io/name=amd-device-plugin
  Service Account:  default
  Containers:
   app:
    Image:      docker.io/rocm/k8s-device-plugin:1.25.2.8
    Port:       <none>
    Host Port:  <none>
    Command:
      ./k8s-device-plugin
    Args:
      -logtostderr=true
      -stderrthreshold=INFO
      -v=5
    Limits:
      memory:  100Mi
    Requests:
      cpu:     10m
      memory:  10Mi
    Environment:
      TZ:  Pacific/Auckland
    Mounts:
      /sys from sys (rw)
      /var/lib/kubelet/device-plugins from device-plugins (rw)
  Volumes:
   device-plugins:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
   sys:
    Type:               HostPath (bare host directory volume)
    Path:               /sys
    HostPathType:
  Priority Class Name:  system-node-critical
  Node-Selectors:       feature.node.kubernetes.io/pci-0300_1002.present=true
                        kubernetes.io/arch=amd64
  Tolerations:          CriticalAddonsOnly op=Exists
Events:
  Type    Reason            Age                 From                  Message
  ----    ------            ----                ----                  -------
  Normal  SuccessfulDelete  59m                 daemonset-controller  Deleted pod: amd-device-plugin-qv4kv
  Normal  SuccessfulDelete  59m                 daemonset-controller  Deleted pod: amd-device-plugin-2m5bz
  Normal  SuccessfulCreate  59m                 daemonset-controller  Created pod: amd-device-plugin-c8cn4
  Normal  SuccessfulCreate  59m                 daemonset-controller  Created pod: amd-device-plugin-nkg7f
  Normal  SuccessfulCreate  54m                 daemonset-controller  Created pod: amd-device-plugin-2kldd
  Normal  SuccessfulCreate  54m                 daemonset-controller  Created pod: amd-device-plugin-lgfdf
  Normal  SuccessfulDelete  54m                 daemonset-controller  Deleted pod: amd-device-plugin-nkg7f
  Normal  SuccessfulDelete  54m                 daemonset-controller  Deleted pod: amd-device-plugin-c8cn4
  Normal  SuccessfulDelete  53m                 daemonset-controller  Deleted pod: amd-device-plugin-rv4l8
  Normal  SuccessfulDelete  53m                 daemonset-controller  Deleted pod: amd-device-plugin-2kldd
  Normal  SuccessfulCreate  53m                 daemonset-controller  Created pod: amd-device-plugin-xbtbm
  Normal  SuccessfulCreate  53m                 daemonset-controller  Created pod: amd-device-plugin-4mgq8
  Normal  SuccessfulDelete  48m                 daemonset-controller  Deleted pod: amd-device-plugin-xbtbm
  Normal  SuccessfulDelete  48m                 daemonset-controller  Deleted pod: amd-device-plugin-4mgq8
  Normal  SuccessfulCreate  48m                 daemonset-controller  Created pod: amd-device-plugin-w2xnm
  Normal  SuccessfulCreate  48m                 daemonset-controller  Created pod: amd-device-plugin-tc486
  Normal  SuccessfulDelete  48m                 daemonset-controller  Deleted pod: amd-device-plugin-lgfdf
  Normal  SuccessfulCreate  48m                 daemonset-controller  Created pod: amd-device-plugin-n8bkz
  Normal  SuccessfulCreate  43m (x21 over 28d)  daemonset-controller  (combined from similar events): Created pod: amd-device-plugin-79rbf
  Normal  SuccessfulDelete  10m (x38 over 28d)  daemonset-controller  (combined from similar events): Deleted pod: amd-device-plugin-z6n7b

talosctl dmesg -n black-knight-02 | grep -i amdgpu

black-knight-02: user: warning: [2024-10-08T04:00:29.037121313Z]: [talos] [initramfs] enabling system extension amdgpu-firmware 20240513
black-knight-02: kern:    info: [2024-10-08T04:00:34.208534313Z]: [drm] amdgpu kernel modesetting enabled.
black-knight-02: kern:    info: [2024-10-08T04:00:34.216765313Z]: amdgpu: Virtual CRAT table created for CPU
black-knight-02: kern:    info: [2024-10-08T04:00:34.217421313Z]: amdgpu: Topology: Add CPU node
black-knight-02: kern:    info: [2024-10-08T04:00:34.218049313Z]: amdgpu 0000:e5:00.0: enabling device (0006 -> 0007)
black-knight-02: kern:    info: [2024-10-08T04:00:34.229941313Z]: amdgpu 0000:e5:00.0: amdgpu: Fetched VBIOS from VFCT
black-knight-02: kern:    info: [2024-10-08T04:00:34.230669313Z]: amdgpu: ATOM BIOS: 113-REMBRANDT-X37
black-knight-02: kern:    info: [2024-10-08T04:00:34.233905313Z]: amdgpu 0000:e5:00.0: vgaarb: deactivate vga console
black-knight-02: kern:    info: [2024-10-08T04:00:34.234649313Z]: amdgpu 0000:e5:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
black-knight-02: kern:    info: [2024-10-08T04:00:34.236985313Z]: amdgpu 0000:e5:00.0: amdgpu: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
black-knight-02: kern:    info: [2024-10-08T04:00:34.238133313Z]: amdgpu 0000:e5:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
black-knight-02: kern:    info: [2024-10-08T04:00:34.239159313Z]: amdgpu 0000:e5:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
black-knight-02: kern:    info: [2024-10-08T04:00:34.241456313Z]: [drm] amdgpu: 512M of VRAM memory ready
black-knight-02: kern:    info: [2024-10-08T04:00:34.242076313Z]: [drm] amdgpu: 31762M of GTT memory ready.
black-knight-02: kern:    info: [2024-10-08T04:00:34.247748313Z]: amdgpu 0000:e5:00.0: amdgpu: Will use PSP to load VCN firmware
black-knight-02: kern:    info: [2024-10-08T04:00:34.425503313Z]: amdgpu 0000:e5:00.0: amdgpu: RAS: optional ras ta ucode is not available
black-knight-02: kern:    info: [2024-10-08T04:00:34.437710313Z]: amdgpu 0000:e5:00.0: amdgpu: RAP: optional rap ta ucode is not available
black-knight-02: kern:    info: [2024-10-08T04:00:34.438651313Z]: amdgpu 0000:e5:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
black-knight-02: kern:    info: [2024-10-08T04:00:34.442337313Z]: amdgpu 0000:e5:00.0: amdgpu: SMU is initialized successfully!
black-knight-02: kern:    info: [2024-10-08T04:00:34.459122313Z]: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
black-knight-02: kern:    info: [2024-10-08T04:00:34.459818313Z]: kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
black-knight-02: kern:    info: [2024-10-08T04:00:34.461457313Z]: amdgpu: Virtual CRAT table created for GPU
black-knight-02: kern:    info: [2024-10-08T04:00:34.462616313Z]: amdgpu: Topology: Add dGPU node [0x1681:0x1002]
black-knight-02: kern:    info: [2024-10-08T04:00:34.463288313Z]: kfd kfd: amdgpu: added device 1002:1681
black-knight-02: kern:    info: [2024-10-08T04:00:34.463891313Z]: amdgpu 0000:e5:00.0: amdgpu: SE 1, SH per SE 2, CU per SH 6, active_cu_number 12
black-knight-02: kern:    info: [2024-10-08T04:00:34.465051313Z]: amdgpu 0000:e5:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.465961313Z]: amdgpu 0000:e5:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.466875313Z]: amdgpu 0000:e5:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.467788313Z]: amdgpu 0000:e5:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.468704313Z]: amdgpu 0000:e5:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.469631313Z]: amdgpu 0000:e5:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.470554313Z]: amdgpu 0000:e5:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.471475313Z]: amdgpu 0000:e5:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.472396313Z]: amdgpu 0000:e5:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.473332313Z]: amdgpu 0000:e5:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.474289313Z]: amdgpu 0000:e5:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.475190313Z]: amdgpu 0000:e5:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 8
black-knight-02: kern:    info: [2024-10-08T04:00:34.476126313Z]: amdgpu 0000:e5:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 8
black-knight-02: kern:    info: [2024-10-08T04:00:34.477095313Z]: amdgpu 0000:e5:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 8
black-knight-02: kern:    info: [2024-10-08T04:00:34.478061313Z]: amdgpu 0000:e5:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
black-knight-02: kern:    info: [2024-10-08T04:00:34.480279313Z]: [drm] Initialized amdgpu 3.54.0 20150101 for 0000:e5:00.0 on minor 0
black-knight-02: kern:    info: [2024-10-08T04:00:34.488976313Z]: amdgpu 0000:e5:00.0: [drm] Cannot find any crtc or sizes

✦ ⬢ [Docker] ❯ talosctl dmesg -n black-knight-03 | grep -i amdgpu

Operating System

Talos v1.8.0

CPU

AMD 6850U CPU with Radeon Graphics

GPU

AMD Radeon VII

ROCm Version

ROCm 6.2.0

ROCm Component

No response

Steps to Reproduce

Upgrade docker.io/rocm/k8s-device-plugin ( 1.25.2.3 → 1.25.2.8).

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 PRO 6850U with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 PRO 6850U with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4768                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    65047108(0x3e08a44) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65047108(0x3e08a44) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65047108(0x3e08a44) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1035                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5761(0x1681)                       
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2200                               
  BDFID:                   58624                              
  Internal Node ID:        1                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 118                                
  SDMA engine uCode::      47                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1035         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done *** 

Additional Information

kubectl get nodes -o wide

NAME              STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
black-knight-01   Ready    control-plane   53d   v1.30.5   10.0.10.25    <none>        Talos (v1.8.0)   6.6.52-talos     containerd://2.0.0-rc.4
black-knight-02   Ready    control-plane   53d   v1.30.5   10.0.10.26    <none>        Talos (v1.8.0)   6.6.52-talos     containerd://2.0.0-rc.4
black-knight-03   Ready    control-plane   53d   v1.30.5   10.0.10.27    <none>        Talos (v1.8.0)   6.6.52-talos     containerd://2.0.0-rc.4

kubectl version

Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.30.5
Kubecolor Version: 0.4.0

kubectl get no -o json | jq ".items[].metadata.labels"

{
  "beta.amd.com/gpu.cu-count.12": "1",
  "beta.amd.com/gpu.device-id.1681": "1",
  "beta.amd.com/gpu.simd-count.24": "1",
  "beta.amd.com/gpu.vram.1G": "1",
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/os": "linux",
  "extensions.talos.dev/amd-ucode": "20240909",
  "extensions.talos.dev/amdgpu-firmware": "20240909",
  "extensions.talos.dev/modules.dep": "6.6.52-talos",
  "extensions.talos.dev/realtek-firmware": "20240909",
  "extensions.talos.dev/thunderbolt": "v1.8.0",
  "feature.node.kubernetes.io/pci-0300_1002.present": "true",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "black-knight-01",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/control-plane": ""
}
{
  "beta.amd.com/gpu.cu-count.12": "1",
  "beta.amd.com/gpu.device-id.1681": "1",
  "beta.amd.com/gpu.simd-count.24": "1",
  "beta.amd.com/gpu.vram.1G": "1",
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/os": "linux",
  "extensions.talos.dev/amd-ucode": "20240909",
  "extensions.talos.dev/amdgpu-firmware": "20240909",
  "extensions.talos.dev/modules.dep": "6.6.52-talos",
  "extensions.talos.dev/realtek-firmware": "20240909",
  "extensions.talos.dev/thunderbolt": "v1.8.0",
  "feature.node.kubernetes.io/pci-0300_1002.present": "true",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "black-knight-02",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/control-plane": ""
}
{
  "beta.amd.com/gpu.cu-count.12": "1",
  "beta.amd.com/gpu.device-id.1681": "1",
  "beta.amd.com/gpu.simd-count.24": "1",
  "beta.amd.com/gpu.vram.1G": "1",
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/os": "linux",
  "extensions.talos.dev/amd-ucode": "20240909",
  "extensions.talos.dev/amdgpu-firmware": "20240909",
  "extensions.talos.dev/modules.dep": "6.6.52-talos",
  "extensions.talos.dev/realtek-firmware": "20240909",
  "extensions.talos.dev/thunderbolt": "v1.8.0",
  "feature.node.kubernetes.io/pci-0300_1002.present": "true",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "black-knight-03",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/control-plane": ""
}

kubectl get nodes -o=jsonpath='{.items[*].status.nodeInfo.architecture}'

amd64 amd64 amd64
y2kenny commented 3 days ago

Can you try to get the crash log with kubectl logs <podname> --previous?

KeyboardDabbler commented 3 days ago

Can you try to get the crash log with kubectl logs <podname> --previous?

v1.25.2.3
kubectl -n kube-system get pods -o wide
NAME                                             READY   STATUS    RESTARTS       AGE   IP            NODE              NOMINATED NODE   READINESS GATES
amd-device-plugin-b5gsh                          1/1     Running   0              18h   10.69.2.122   black-knight-02   <none>           <none>
amd-device-plugin-d5rrd                          1/1     Running   0              18h   10.69.0.180   black-knight-03   <none>           <none>
amd-device-plugin-sf25x                          1/1     Running   0              18h   10.69.1.42    black-knight-01   <none>           <none>
amd-gpu-node-labeller-g8ntt                      1/1     Running   0              12h   10.69.1.30    black-knight-01   <none>           <none>
amd-gpu-node-labeller-xqvf8                      1/1     Running   0              12h   10.69.0.220   black-knight-03   <none>           <none>
amd-gpu-node-labeller-zz7wk                      1/1     Running   0              12h   10.69.2.227   black-knight-02   <none>           <none>

[Docker] ❯ kubectl logs amd-device-plugin-d5rrd -n kube-system 
I1009 04:38:33.064708       1 main.go:305] AMD GPU device plugin for Kubernetes
I1009 04:38:33.064751       1 main.go:305] ./k8s-device-plugin version v1.18.1-20-gb8f1ee8
I1009 04:38:33.064756       1 main.go:305] hwloc: _VERSION: 2.9.1, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I1009 04:38:33.064770       1 manager.go:42] Starting device plugin manager
I1009 04:38:33.064777       1 manager.go:46] Registering for system signal notifications
I1009 04:38:33.064892       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
I1009 04:38:33.064946       1 manager.go:60] Starting Discovery on new plugins
I1009 04:38:33.064955       1 manager.go:66] Handling incoming signals
I1009 04:38:33.064964       1 manager.go:71] Received new list of plugins: [gpu]
I1009 04:38:33.065009       1 manager.go:110] Adding a new plugin "gpu"
I1009 04:38:33.065018       1 plugin.go:64] gpu: Starting plugin server
I1009 04:38:33.065023       1 plugin.go:94] gpu: Starting the DPI gRPC server
I1009 04:38:33.065321       1 plugin.go:112] gpu: Serving requests...
I1009 04:38:43.067432       1 plugin.go:128] gpu: Registering the DPI with Kubelet
I1009 04:38:43.068096       1 plugin.go:140] gpu: Registration for endpoint amd.com_gpu
I1009 04:38:43.069980       1 amdgpu.go:100] /sys/module/amdgpu/drivers/pci:amdgpu/0000:e5:00.0
I1009 04:38:43.209343       1 main.go:149] Watching GPU with bus ID: 0000:e5:00.0 NUMA Node: []
E1009 04:38:43.209357       1 main.go:151] No NUMA node found with bus ID: 0000:e5:00.0

v1.25.2.8
kubectl -n kube-system get pods -o wide
NAME                                             READY   STATUS    RESTARTS       AGE   IP            NODE              NOMINATED NODE   READINESS GATES
amd-device-plugin-5htvl                          1/1     Running   0              5s    10.69.2.71    black-knight-02   <none>           <none>
amd-device-plugin-gw4d4                          0/1     CrashLoopBackOff   4 (58s ago)    2m23s   10.69.0.161   black-knight-03   <none>           <none>
amd-device-plugin-sf25x                          1/1     Running   0              18h   10.69.1.42    black-knight-01   <none>           <none>
amd-gpu-node-labeller-g8ntt                      1/1     Running   0              12h   10.69.1.30    black-knight-01   <none>           <none>
amd-gpu-node-labeller-xqvf8                      1/1     Running   0              12h   10.69.0.220   black-knight-03   <none>           <none>
amd-gpu-node-labeller-zz7wk                      1/1     Running   0              12h   10.69.2.227   black-knight-02   <none>           <none>

⬢ [Docker] ❯ kubectl logs amd-device-plugin-gw4d4 -n kube-system --previous
exec ./k8s-device-plugin: exec format error
y2kenny commented 3 days ago

ah ok... sorry I misunderstood what you had before. This is weird... let me look into it.

y2kenny commented 3 days ago

Can you help narrow down the start of the issue a bit? i.e. do you see the same issue with 1.25.2.4 and .5? (I don't have a Talos setup to reproduce and I am able to use the plugin tip of tree.)

KeyboardDabbler commented 3 days ago

Can you help narrow down the start of the issue a bit? i.e. do you see the same issue with 1.25.2.4 and .5? (I don't have a Talos setup to reproduce and I am able to use the plugin tip of tree.)

Sorry i thought i tested v1.25.2.4, but i suspect i didn't allow enough time for flux to update the commit.

I have now tried the following tags v1.25.2.2 ✔ v1.25.2.3 ✔ v1.25.2.4 ✔ v1.25.2.5 ✔ v1.25.2.6 ✔ v1.25.2.7 ✔ v1.25.2.8 ✖ (failing on node 3)

Thanks for looking into this. Hopefully, this helps narrow down the issue to the latest changes!

y2kenny commented 3 days ago

Um... I am not able to reproduce in my setup: image

Actually, I just noticed this... you said it is failing on node 3, does that means the plugin is working in other nodes? If that's the case, this doesn't seem like a plugin issue.