ROCm / ROCm

AMD ROCm™ Software - GitHub Home
https://rocm.docs.amd.com
MIT License
4.48k stars 370 forks source link

[Issue]: Soft lockup when having high vram usage #3580

Open hartmark opened 1 month ago

hartmark commented 1 month ago

Problem Description

I have AMD Radeon RX 7800 XT 16 GB, I couldn't select it in the list.

I'm having problem with soft locks when generating stable diffusion images. I need to restart lightdm or do a sysrq-r,e,i to kill all processes.

Operating System

Arch Linux

CPU

AMD Ryzen 9 5900X 12-Core Processor

GPU

AMD Radeon RX 7900 XT

ROCm Version

ROCm 6.2.0

ROCm Component

No response

Steps to Reproduce

I'm running ComfyUI in a docker container. I have created a repo for the docker compose script. You can use this to reproduce the problem.

  1. clone https://github.com/hartmark/sd-rocm
  2. download absolutereality model at https://civitai.com/models/81458/absolutereality and save it at data/checkpoints
  3. Startup the docker container with docker-compose up
  4. Wait until ComfyUI have started
  5. Go to ComfyUI at http://localhost
  6. Load this workflow and run workflow.json

It crashes on the first VAE decode

If I generate a smaller image like 1024x1024 or lower It will sometimes generate the whole workflow or fail on the second KSampler

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded =====================
HSA System Attributes
=====================
Runtime Version: 1.14 Runtime Ext Version: 1.6 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED DMAbuf Support: YES

==========
HSA Agents
==========


Agent 1


Name: AMD Ryzen 9 5900X 12-Core Processor Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 5900X 12-Core Processor Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3700
BDFID: 0
Internal Node ID: 0
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32767952(0x1f3ffd0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 32767952(0x1f3ffd0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32767952(0x1f3ffd0) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx1101
Uuid: GPU-8e404be42e45fb92
Marketing Name: AMD Radeon RX 7800 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 4096(0x1000) KB
L3: 65536(0x10000) KB
Chip ID: 29822(0x747e)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2169
BDFID: 2560
Internal Node ID: 1
Compute Unit: 60
SIMDs per CU: 2
Shader Engines: 3
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 222
SDMA engine uCode:: 22
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1101
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Done

Additional Information

I have previous filed an bug at AMDGPU project but I think the issue is something ROCm related. https://gitlab.freedesktop.org/drm/amd/-/issues/3548

2eQTu commented 1 month ago

FYI to AMD folks, the POC issue bot assigned the wrong label for hardware. Per output above, submitter is talking about an "AMD Radeon RX 7800 XT" , but no such label exists and the bot assigned "AMD Radeon RX 7900 XT" instead.

harkgill-amd commented 3 weeks ago

Hi @hartmark, thank you for providing the steps to reproduce. We will try to reproduce the issue internally and investigate it from there.

hartmark commented 3 weeks ago

Hi @hartmark, thank you for providing the steps to reproduce. We will try to reproduce the issue internally and investigate it from there.

Cool, just poke me if you need more details regarding my setup.

hartmark commented 3 weeks ago

I have noticed that this is also a certain way to reproduce the issue: https://github.com/ROCm/ROCm/issues/2196#issuecomment-2295441030

hartmark commented 3 weeks ago

I saw that ROCm 6.2 python torch libraries have been released so I have updated my docker-compose.repo.

I also may have found a workaround for the lockups, ComfyUI released a --reserve-vram flag and setting it to 6.0 seems to fixed my issue with random lockups. --reserve-vram 6

https://github.com/comfyanonymous/ComfyUI/commit/045377ea893d0703e515d87f891936784cb2f5de

Even though I have set reserve limit to 6GB I have almost all gpu memory used image

This is my full ComfyUI startup-line for reference: PYTORCH_HIP_ALLOC_CONF=expandable_segments:True python main.py --listen 0.0.0.0 --port 80 --use-split-cross-attention --front-end-version Comfy-Org/ComfyUI_frontend@latest --reserve-vram 6

hartmark commented 3 weeks ago

I may have been to quick to hope for success. After the KSampler step completed it was on "VAE decode" for ages and after a while I got a lockup but it seems python libs was unhappy.

See attached journalctrl log journalctl.txt

hartmark commented 3 weeks ago

I have tested some more and it seems that ComfyUI just gets out of memory now and doesn't try to use system ram to continue.

I have tried using --lowvram and still get no system ram used.

hartmark commented 2 weeks ago

It seems I still have problems where my whole computer locks up, not even altgr-rei can make me recover and there's nothing in the kernel log.

alexxu-amd commented 2 weeks ago

Hi @hartmark , Thank you for your patience. For the past few days, we've been attempting to reproduce this issue with various configurations. Just an update that we are able to reproduce this issue. The lockup can happen during the KSampler step or the VAE step when generating a 2048 * 2048 image. So far, it seems --no-half-vae flag from the webui does help with the VAE lockup, but the system could still encounter the lockup during KSampler. We will investigate further and keep you updated.

hartmark commented 2 weeks ago

Hi @hartmark , Thank you for your patience. For the past few days, we've been attempting to reproduce this issue with various configurations. Just an update that we are able to reproduce this issue. The lockup can happen during the KSampler step or the VAE step when generating a 2048 * 2048 image. So far, it seems --no-half-vae flag from the webui does help with the VAE lockup, but the system could still encounter the lockup during KSampler. We will investigate further and keep you updated.

Hmm, it seems there is no --no-half-vae

stable-diffusion-comfyui-1  | usage: main.py [-h] [--listen [IP]] [--port PORT] [--tls-keyfile TLS_KEYFILE]
stable-diffusion-comfyui-1  |                [--tls-certfile TLS_CERTFILE] [--enable-cors-header [ORIGIN]]
stable-diffusion-comfyui-1  |                [--max-upload-size MAX_UPLOAD_SIZE]
stable-diffusion-comfyui-1  |                [--extra-model-paths-config PATH [PATH ...]]
stable-diffusion-comfyui-1  |                [--output-directory OUTPUT_DIRECTORY]
stable-diffusion-comfyui-1  |                [--temp-directory TEMP_DIRECTORY]
stable-diffusion-comfyui-1  |                [--input-directory INPUT_DIRECTORY] [--auto-launch]
stable-diffusion-comfyui-1  |                [--disable-auto-launch] [--cuda-device DEVICE_ID]
stable-diffusion-comfyui-1  |                [--cuda-malloc | --disable-cuda-malloc]
stable-diffusion-comfyui-1  |                [--force-fp32 | --force-fp16]
stable-diffusion-comfyui-1  |                [--bf16-unet | --fp16-unet | --fp8_e4m3fn-unet | --fp8_e5m2-unet]
stable-diffusion-comfyui-1  |                [--fp16-vae | --fp32-vae | --bf16-vae] [--cpu-vae]
stable-diffusion-comfyui-1  |                [--fp8_e4m3fn-text-enc | --fp8_e5m2-text-enc | --fp16-text-enc | --fp32-text-enc]
stable-diffusion-comfyui-1  |                [--force-channels-last] [--directml [DIRECTML_DEVICE]]
stable-diffusion-comfyui-1  |                [--disable-ipex-optimize]
stable-diffusion-comfyui-1  |                [--preview-method [none,auto,latent2rgb,taesd]]
stable-diffusion-comfyui-1  |                [--cache-classic | --cache-lru CACHE_LRU]
stable-diffusion-comfyui-1  |                [--use-split-cross-attention | --use-quad-cross-attention | --use-pytorch-cross-attention]
stable-diffusion-comfyui-1  |                [--disable-xformers]
stable-diffusion-comfyui-1  |                [--force-upcast-attention | --dont-upcast-attention]
stable-diffusion-comfyui-1  |                [--gpu-only | --highvram | --normalvram | --lowvram | --novram | --cpu]
stable-diffusion-comfyui-1  |                [--reserve-vram RESERVE_VRAM]
stable-diffusion-comfyui-1  |                [--default-hashing-function {md5,sha1,sha256,sha512}]
stable-diffusion-comfyui-1  |                [--disable-smart-memory] [--deterministic] [--fast]
stable-diffusion-comfyui-1  |                [--dont-print-server] [--quick-test-for-ci]
stable-diffusion-comfyui-1  |                [--windows-standalone-build] [--disable-metadata]
stable-diffusion-comfyui-1  |                [--disable-all-custom-nodes] [--multi-user] [--verbose]
stable-diffusion-comfyui-1  |                [--front-end-version FRONT_END_VERSION]
stable-diffusion-comfyui-1  |                [--front-end-root FRONT_END_ROOT]
stable-diffusion-comfyui-1  | main.py: error: unrecognized arguments: --no-half-vae
alexxu-amd commented 2 weeks ago

The --no-half-vae flag is for the webui. So in this case, it is added to your startup-webui.sh file: python3 launch.py --skip-python-version-check --enable-insecure-extension-access --listen --port 81 --api --precision full --no-half --no-half-vae

hartmark commented 2 weeks ago

The --no-half-vae flag is for the webui. So in this case, it is added to your startup-webui.sh file: python3 launch.py --skip-python-version-check --enable-insecure-extension-access --listen --port 81 --api --precision full --no-half --no-half-vae

Aha, I'm mostly using comfyUI, is there any workaround for it as well?

schung-amd commented 2 weeks ago

At a glance, I believe the equivalent flags here are --force-fp32 and --fp32-vae.

hartmark commented 1 week ago

At a glance, I believe the equivalent flags here are --force-fp32 and --fp32-vae.

These flags seems to have helped the stability. It took around 10 minutes to get the KSampler finished but the Decode VAE step was taking forever, I aborted after 30 minutes. Good thing is that it didn't crash consistently after just a few seconds.

However, if I run this workflow it does crash consistently again. https://github.com/ROCm/ROCm/issues/3580#issuecomment-2299857204 I got it to work with setting just 512x512 and added --lowvram --reserve-vram 3 and also switched to use flux GGUF G5_1.

Is there any more logging I can enable or make the debugging easier to pinpoint the issue?