Closed arch-user-france1 closed 1 year ago
it's a driver problem it‘s sleep and awake failed it will cause rocm program freeze try call rocm-smi 1 time per second
@sdli1995, I think this indeed may be a driver problem, however, it has nothing to do with sleep/awake. The problem appears while doing ROCm operations, such as seen in creating a PyTorch Dataloader.
I have the same card. I am happy to replicate/debug I am having all sorts of issues with 5.7.1 and kernel calls driver reset - card still hangs blocking.
i call hard reset rocm-smi --gpureset -d 0
and sometimes it resets - other times it still blocks on the next launch and a full power cycle is needed.
running 5.7.1 repo, pytorch nightly, ubuntu 22.04
let me know what other info you need for me to help this out
I had a similar situation with rocm5.7.1 on my 7900xtx, after a few days of trying, I almost solved the problem, but encountered a new problem.
First I tried uninstalling the driver and reinstall it using the following command:
sudo amdgpu-install --usecase=rocm --rocmrelease=5.7.1 --no-dkms
Then I ran into the following problem while running my pytorch code
[ 4280.134636] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process python pid 4860 thread python pid 4860)
[ 4280.134637] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x00007f8e3a509000 from client 10
[ 4280.134638] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00801A31
[ 4280.134639] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: SDMA0 (0xd)
[ 4280.134640] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
[ 4280.134640] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
[ 4280.134641] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 4280.134642] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
[ 4280.134642] amdgpu 0000:03:00.0: amdgpu: RW: 0x0
Fortunately, this no longer seems to interrupt the code I'm running
@cwjyu, I am not seeing any crashes in the logs anymore after reinstalling ROCm like you suggested, however, apart from that, I am still observing the code being stuck and rocm-smi
reporting 100% GPU usage.
got the same problem with iGPU in 7840HS while running waifu2x with openCL
@cwjyu, I am not seeing any crashes in the logs anymore after reinstalling ROCm like you suggested, however, apart from that, I am still observing the code being stuck and
rocm-smi
reporting 100% GPU usage.
@arch-user-france1 After I uninstalled the rocm driver, I also removed the 6.0.26 kernel it installed. Here are the versions of the operating system I use:
Linux wjy-MS-7D97 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
And I am running my code in another character terminal, because this driver will cause my graphical interface to die, I need to use ctrl+alt+F1 to restart the graphical interface on my operating system, and even ssh to kill the graphical terminal in serious cases.
It seems to be working if the drivers are installed with the --no-dkms
option. On my Ubuntu 22.04 system, the DKMS does not compile anymore. I think this problem has nothing to do with ROCm itself but rather the installation program, which is why I'll close this issue. I didn't have to remove any kernels, but I did not restart my system but merely run sudo modprobe amdgpu
after installing the drivers in the apparently correct way.
@arch-user-france1 How to do this in arch linux? I'm not using dkms driver but still got this issue.
update:
I switched kernel version from 6.6.1.arch1-1
to 6.5.2.8.realtime1-1-rt
, and the issue was gone.
update:
I still meet this issue, and have tried iommu=soft
, sg_display=0
to kernel config. The later one seems to work.
Leaving some links here for the next person who has this problem:
See also https://github.com/ROCm/ROCm/issues/2642 which links to https://github.com/ROCm/ROCm/issues/2596 which links to the kernel patch https://lists.freedesktop.org/archives/amd-gfx/2023-October/100298.html which resolved a similar resetting problem I was having.
The issue is closed, but would like to confirm that this bug does not occur for me after upgrading to ROCm 6.0 anymore (my chip is Radeon RX 7900 XTX). Stable Diffusion with torch-2.3.0+rocm5.7 still works on ROCm 6.0, and without crashes now.
The non-critical GCVM_L2_PROTECTION_FAULT_STATUS errors (mentioned by @cwjyu in the comments above) are still there, but they don't force me to restart after a failed GPU reset.
Unfortunately, there seem to be issues with the driver relating to GPU resets on the AMD Radeon RX 7900 XT, and I have not managed to run any model without the graphics card crashing after some time. Language models do work most of the time, however, they are evaluated at a speed slower than the CPU achieves. The resets may result in a completely frozen user interface or the python process continuing to run forever (see log).
Result:
This model was run on FastAI and PyTorch nightly (for rocm 5.7). The device, which was selected by FastAI, is
cuda:0
.Following information was collected from the logs after the occurrence:
rocm-smi
reports the following:In this case, the crash has resulted in the GPU being used by 100%, which commonly happens. Killing the python process does not decrease the GPU's utilization. The GPU does not seem to be used by 100% in reality, as its frequency is on a low level.
Suspending the system may temporarily unfreeze these values. Subsequent runs without suspending first of the code results in the same error, but without GPU resets and 100% CPU usage of one thread.