ROCm / k8s-device-plugin

Kubernetes (k8s) device plugin to enable registration of AMD GPU to a container cluster
Apache License 2.0
269 stars 47 forks source link

Runtime Error with AMD GPU Helm Chart Installation in Kubernetes #48

Open maarten-blokker opened 9 months ago

maarten-blokker commented 9 months ago

I am experiencing a runtime error while trying to install the AMD GPU Helm Chart (link: AMD GPU Helm Chart) in my Kubernetes cluster. The pod spawned by the daemonset fails to run, and the error log indicates a segmentation violation (SIGSEGV), perhaps related to some permission issues?

I0113 15:54:58.192567       1 main.go:305] AMD GPU device plugin for Kubernetes
I0113 15:54:58.192606       1 main.go:305] ./k8s-device-plugin version v1.18.1-27-g5eb0a0f
I0113 15:54:58.192608       1 main.go:305] hwloc: _VERSION: 2.9.2, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I0113 15:54:58.192613       1 manager.go:42] Starting device plugin manager
I0113 15:54:58.192617       1 manager.go:46] Registering for system signal notifications
I0113 15:54:58.192738       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
panic: runtime error: invalid memory address or nil pointer dereference
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x5333f9]

goroutine 1 [running]:
github.com/fsnotify/fsnotify.(*Watcher).Close(0x0)
    /go/src/github.com/RadeonOpenCompute/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:75 +0x19
panic({0x8e5fc0, 0xd3af30})
    /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/fsnotify/fsnotify.(*Watcher).isClosed(...)
    /go/src/github.com/RadeonOpenCompute/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:66
github.com/fsnotify/fsnotify.(*Watcher).Add(0x0, {0x986f88?, 0xc000116000?})
    /go/src/github.com/RadeonOpenCompute/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:94 +0x6b
github.com/kubevirt/device-plugin-manager/pkg/dpm.(*Manager).Run(0xc00006be70)
    /go/src/github.com/RadeonOpenCompute/k8s-device-plugin/vendor/github.com/kubevirt/device-plugin-manager/pkg/dpm/manager.go:55 +0x226
main.main()
    /go/src/github.com/RadeonOpenCompute/k8s-device-plugin/cmd/k8s-device-plugin/main.go:331 +0x4bb

Steps to Reproduce:

Expected Behavior: The AMD GPU device plugin should install without errors and run successfully in the Kubernetes cluster.

Actual Behavior: The pod crashes immediately after starting, with the log indicating a segmentation fault.

Extra information

mishak87 commented 4 months ago

I have similar error with latest version

I0525 16:11:36.754627       1 main.go:305] AMD GPU device plugin for Kubernetes
I0525 16:11:36.754694       1 main.go:305] ./k8s-device-plugin version v1.25.2.7-7-g813f150
I0525 16:11:36.754701       1 main.go:305] hwloc: _VERSION: 2.10.0, _API_VERSION: 0x00020800, _COMPONENT_ABI: 7, Runtime: 0x00020800
I0525 16:11:36.754709       1 manager.go:42] Starting device plugin manager
I0525 16:11:36.754719       1 manager.go:46] Registering for system signal notifications
I0525 16:11:36.754985       1 manager.go:52] Registering for notifications of filesystem changes in device plugin directory
panic: runtime error: invalid memory address or nil pointer dereference
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x53bbf3]

goroutine 1 [running]:
github.com/fsnotify/fsnotify.(*Watcher).isClosed(...)
    /go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:66
github.com/fsnotify/fsnotify.(*Watcher).Close(0x0)
    /go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:75 +0x13
panic({0x8cf480?, 0xd53320?})
    /usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/fsnotify/fsnotify.(*Watcher).isClosed(...)
    /go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:66
github.com/fsnotify/fsnotify.(*Watcher).Add(0x0, {0x977505?, 0xc000204010?})
    /go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/fsnotify/fsnotify/inotify.go:94 +0x57
github.com/kubevirt/device-plugin-manager/pkg/dpm.(*Manager).Run(0xc000033550)
    /go/src/github.com/ROCm/k8s-device-plugin/vendor/github.com/kubevirt/device-plugin-manager/pkg/dpm/manager.go:55 +0x21f
main.main()
    /go/src/github.com/ROCm/k8s-device-plugin/cmd/k8s-device-plugin/main.go:331 +0x4e9

EDIT: Installing latest ROCm and hard restart fixed the issue. https://repo.radeon.com/amdgpu-install/latest/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.deb

mishak87 commented 3 weeks ago

After hard restart it happened again :-(