Open Jmvars opened 7 months ago
Update:
My graphics driver seemingly crashed mid generation and now I have the issue again, possibly it's graphics related.
Hiya, I've been dealing with this exact issue, if you run sudo dmesg
in console can you check if you have these errors (posted below). I'm running the 7900xtx and semi-resolved this with the following, my solution was to install ROCm (and drivers etc.) while running the Linux-6.2.0-37 kernel, then afterwards upgrading to the latest ubuntu OEM kernel (6.5) by running sudo apt install linux-oem-22.04d
keep in mind that this version is not supported by ROCm but if you have the same issue it'll improve stability by a lot it. it's not a perfect solution (as it's a bit of a franken solution) but hopefully this helps
[ 161.688134] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdffee000 flags=0x0020]
[ 163.625128] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:158 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 163.625136] amdgpu 0000:2f:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 163.625139] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xdc1eb000 flags=0x0000]
[ 163.625140] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3C
[ 163.625142] amdgpu 0000:2f:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 163.625144] amdgpu 0000:2f:00.0: amdgpu: MORE_FAULTS: 0x0
[ 163.625146] amdgpu 0000:2f:00.0: amdgpu: WALKER_ERROR: 0x6
[ 163.625147] amdgpu 0000:2f:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 163.625150] amdgpu 0000:2f:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 163.625152] amdgpu 0000:2f:00.0: amdgpu: RW: 0x0```
Hiya, I've been dealing with this exact issue, if you run
sudo dmesg
in console can you check if you have these errors (posted below). I'm running the 7900xtx and semi-resolved this with the following, my solution was to install ROCm (and drivers etc.) while running the Linux-6.2.0-37 kernel, then afterwards upgrading to the latest ubuntu OEM kernel (6.5) by runningsudo apt install linux-oem-22.04d
keep in mind that this version is not supported by ROCm but if you have the same issue it'll improve stability by a lot it. it's not a perfect solution (as it's a bit of a franken solution) but hopefully this helps
Do I run this whenever or after the fault happens? Maybe a stupid question I just want to make sure.
shouldn't matter, most of the time (if you share the error I had) the fault will show in your dmesg pretty much from the moment you start 1111
EDIT: for your information it does not change anything in your system it gives you a readout of a log
@hqnicolas Thanks, It'll have to wait as my drive died and I'm waiting on a replacement.
@hqnicolas Thanks, It'll have to wait as my drive died and I'm waiting on a replacement.
No way! ASRock?
Is there an existing issue for this?
What happened?
WebUI runs flawlessly for a while but will then become completely unresponsive to anything, embeddings and LoRA's load permanently and when generating it never starts and nothing happens in terminal, no progress bar shows up. I have not figured out when it happens but it seems to be when WebUI is restarted and/or PC is rebooted. The only fix I found is completely reinstalling the entire operating system.
Steps to reproduce the problem
What should have happened?
WebUI should be responsive after being restarted/machine is rebooted.
Sysinfo
sysinfo-2023-11-27-12-46.txt
What browsers do you use to access the UI ?
Mozilla Firefox
Console logs
Additional information
Ubuntu 22.04 AMD RX 7900 XT PyTorch 2.0.1+ROCm 5.7
How I install WebUI is a combination of this and this guide:
1.Install driver as per the AMD guide
I tried a whole bunch of installation guides, combining several guides and this is the only one I found that actually works with my GPU. As I said previously the only fix I found is to completely reinstall Ubuntu. I tried removing the WebUI folder and uninstalling drivers. This should get rid of both PyTorch and Torchvision and I confirmed so with pip outside of the virtual enviroment, I then tried rebooting, reinstalling drivers, reinstalling WebUI the exact same way but it doesn't work, only reinstalling the entire OS works.