comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
54.28k stars 5.75k forks source link

AMD SVD Performance / Stability #2304

Open Gracana opened 10 months ago

Gracana commented 10 months ago

I'm having trouble running the stable video diffusion examples on my machine.

OS: Arch linux CPU: AMD Ryzen 9 7950X RAM: 64GB GPU: AMD Radeon RX 7900 XTX VRAM: 24GB Software: ComfyUI 329c57199302f6b9ccfebb86c96e937c386da92f, Rocm 5.6... Wait. See follow-up at the end.

When I tried running the 14 frame example, it was very slow and my GPU eventually locked up. dmesg shows this:

Dec 15 17:47:10 hawk kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=20268, emitted seq=20269
Dec 15 17:47:10 hawk kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1718 thread gnome-shel:cs0 pid 1755
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow start
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu: recover vram bo from shadow done
Dec 15 17:47:10 hawk kernel: [drm] ring gfx_32779.1.1 was added
Dec 15 17:47:10 hawk kernel: [drm] ring compute_32779.2.2 was added
Dec 15 17:47:10 hawk kernel: [drm] ring sdma_32779.3.3 was added
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40 vmid:8 pasid:32779, for process  pid 0 thread  pid 0)
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000800100269000 from client 10
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00840C50
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu:          Faulty UTCL2 client ID: CPG (0x6)
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu:          MORE_FAULTS: 0x0
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu:          WALKER_ERROR: 0x0
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu:          PERMISSION_FAULTS: 0x5
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu:          MAPPING_ERROR: 0x0
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu:          RW: 0x1
Dec 15 17:47:10 hawk kernel: [drm] ring gfx_32779.1.1 ib test pass
Dec 15 17:47:10 hawk kernel: [drm] ring compute_32779.2.2 ib test pass
Dec 15 17:47:10 hawk kernel: [drm] ring sdma_32779.3.3 ib test pass
Dec 15 17:47:10 hawk kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset(2) succeeded!
Dec 15 17:48:00 hawk gnome-shell[1718]: amdgpu: The CS has been rejected (-125), but the context isn't robust.
Dec 15 17:48:00 hawk gnome-shell[1718]: amdgpu: The process will be terminated.
Dec 15 17:48:00 hawk kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!

That was after I added the iommu=soft kernel parameter. Before I would see IO_PAGE_FAULT in the logs, among other things. I'm not sure if it's particularly interesting to see the details.

The previous GPU crash (before setting iommu=soft) started with this:

Dec 14 21:40:41 hawk kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1788838, emitted seq=1788840
Dec 14 21:40:42 hawk kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process kgx pid 2013 thread kgx:cs0 pid 2414

I'm also getting ~60s/it, which seems terribly slow. In regular sd-1.5 I get 10-15it/s for smaller (512x512) images, and it works great generating image after image.

If I run SVD with reduced settings, I can get through the process and produce a video, but it's still very slow.

I tried all the different cross-attention methods, tried forcing fp16 and fp32, tried highvram and disable-smart-memory. Nothing changed the speed appreciably.

Any idea what might be going on here?

[Update] Ok, when I went to write down what versions of software I was running, I noticed I had ROCm 5.6. I installed 5.7, and now I get 3-4s/it in KSampler, and the whole prompt finished in 181s. I think this is solved, but I'll submit the issue anyway, if only for the record.

Gracana commented 10 months ago

After a hot start with everything loaded, it's finishing the prompt in 80s. This is fantastic!

lubosz commented 9 months ago

Thanks for documenting this. I am seeing the same GPU lockup on ROCm 5.7.1 on a RX 6900 when running the 14-frame SVD example workflow from the documentation: https://comfyanonymous.github.io/ComfyUI_examples/video/

Are you able to run this with default settings? Do you have any kernel parameters set? What are the reduced settings of SVD that avoided the lockup?

My dmesg:

[  662.787190] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=45156, emitted seq=45157
[  662.787695] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 4770 thread gnome-shel:cs0 pid 4814
[  662.788165] amdgpu 0000:0d:00.0: amdgpu: GPU reset begin!
[  662.788192] amdgpu: Failed to suspend process 0x800c
[  662.803165] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[  663.173290] amdgpu 0000:0d:00.0: amdgpu: MODE1 reset

kernel: 6.6.8-arch1-1

lubosz commented 9 months ago

After reducing some settings I am able to complete the 14-frame workflow sometimes.

But this is not enough, I also need to set the power_dpm_force_performance_level to high.

sudo su
echo high > /sys/class/drm/card1/device/power_dpm_force_performance_level

Even then, interacting with the desktop when the workflow runs can also lead to the GPU lockup. Or maybe it's even just luck, sometimes it also locks up with these exact settings.

Also I am only getting around ~70s/it.

So this seems to be a similar situation you have been experiencing before upgrading to ROCm 5.7, only that I am already on that version. Did you change anything else that might have impacted this behaviour?

The VRAM usage seems to max out at ~12.5GiB (out of 15.984 GiB), the utilization is always at 100% when running the workflow.

Gracana commented 9 months ago

I don't think I did anything else to make it work. You definitely seem to have the same symptoms as I did originally, but ROCm 5.7 solved it for me.

lubosz commented 9 months ago

The issue only occurs when a desktop session is running. It doesn't seem to matter if it's wayland or X11. I was able to complete the workflow with a undesirable performance while no desktop was running, finishing the full resolution unmodified 14-frame workflow in 2974.10 seconds (thats 49.5 minutes), at about 140s/it. 10.779 / 15.984 GIB memory usage.

Setting the AMDGPU power profile to COMPUTE didn't seem to have impact on the issue. https://wiki.archlinux.org/title/AMDGPU#Power_profiles

Currently setting up ROCm 6.0 to see if that helps.

lubosz commented 9 months ago

It took some time to set up, as packaging wasn't there yet, but I was able to test this on ROCm 6.0.0. The issue is resolved, improving performance, which is now at 7s/it-10s/it and not running into the lockup.