Closed devurandom closed 1 month ago
I'm using my vacation to catch up with a long backlog of things I didn't get to. Sorry about the (very) late response.
Is this still an issue? I see you were using a 5.7 kernel, which wasn't supported by our DKMS driver at the time. Were you trying to backport it? Or were you using the KFD version included in the 5.7 kernel? I see the error "Failure to set tba address. error -1." At that point the process creation in KFD should have failed and any further ioctl call should have been impossible. I think we had some bugs handling error returns from kfd_create_process at some point, but those should have been fixed by now. That wouldn't fix the underlying TBA allocation error, but it would cause all ROCm apps to fail during initialization and prevent the kernel oops from happening.
Your system is also "interesting" because you have two different GPUs in it: a GFXv9 integrated GPU and a GFXv8 discrete GPU. We have some improvements for this situation in the current Thunk. The Thunk got better at supporting such mixed configurations since ROCm 3.9, by treating APUs like dGPUs in such configurations. But I'm not sure if all the kinks are worked out yet. Would be good to hear an update.
Thanks, and happy holidays.
Closing as there is no update from user. Thanks
System information
rocminfo
is at version 3.5.0.Problem
When I run
rocminfo
on my system, I see:rocminfo
is not SIGKILL-able at that point.This is reproducible every time I run
rocminfo
.Logs
dmesg
prints during execution ofrocminfo
:Other information
I also see exceptions and segfaults in Clover and ROCm's OpenCL implementation when executing
clinfo
:Until recently
rocminfo
would segfault and eventually bring the whole kernel down with it:Previously also HIP would freeze the system, possibly because it invokes
rocminfo
in the background: https://github.com/ROCm-Developer-Tools/HIP/issues/2132