comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
56.42k stars 5.98k forks source link

commandline argument --cuda-device is ineffective for HIP/ROCm (AMD) backend #5585

Closed Bratzmeister closed 22 hours ago

Bratzmeister commented 1 day ago

Expected Behavior

when using e.g. --cuda-device 1 I want ComfyUI to use device with ID 1

Actual Behavior

it doesn't matter what I set in --cuda-device, ComfyUI will use device with ID 0 in any case

Steps to Reproduce

  1. start ComfyUI with --cuda-device <any ID other than 0> on an AMD/HIP system
  2. load any workflow
  3. queue/start inference or whatever else you do with comfy on your GPU
  4. see in system/OS monitoring tools that GPU with ID 0 is used no matter what was set in step 1

Debug Logs

it's not visible from the logs because there is no error created per se because the log output assumes the operation is complete by setting the environment variable CUDA_VISIBLE_DEVICES which was successful. however I can reproduce the behavior in plain python/torch. See next field for PoC and explanation

Other

For reference I have two AMD RX 7900 XT in my system and when using --cuda-device 1 it's supposed to only expose the 2nd GPU to torch/cuda but since the switch is not working due to the environment variable being ignored by pytorch+rocm it will use the default cuda device which is cuda0 i.e. my 1st GPU not the 2nd. So below is an example showcasing that the correct environment variable yields the expected result.

example setting CUDA_VISIBLE_DEVICES to 1 as implemented by --cuda-device 1 in https://github.com/comfyanonymous/ComfyUI/blob/2d28b0b4790e3f6c2287be49d9872419eadfe5bb/main.py#L73

(comfy_env) axt@weilichskann ~/zeugs/AI/ComfyUI $ CUDA_VISIBLE_DEVICES=1 python
Python 3.11.10 (main, Sep 20 2024, 14:12:56) [GCC 13.3.1 20240614] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda # output is empty because no cuda backend exists for this torch version  
>>> torch.version.hip
'6.1.40091-a8dbc0c19'
>>> torch.cuda.device_count()
2
>>> torch.cuda.get_device_name(0)
'AMD Radeon RX 7900 XT'
>>> torch.cuda.get_device_name(1)
'AMD Radeon RX 7900 XT'

so however when setting HIP_VISIBLE_DEVICES instead it will actually work

(comfy_env) axt@weilichskann ~/zeugs/AI/ComfyUI $ HIP_VISIBLE_DEVICES=1 python 
Python 3.11.10 (main, Sep 20 2024, 14:12:56) [GCC 13.3.1 20240614] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.device_count()
1

Sadly the pytorch and ROCm documentation is a bit misleading in this regard and one would assume that the two env vars are exchangeable but apparently despite the assumption in ROCm documentation (see links below) it's not the case for pytorch

https://pytorch.org/docs/stable/notes/hip.html https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html#cuda-visible-devices

(comfy_env) axt@weilichskann ~/zeugs/AI/ComfyUI $ rocm-smi -i

============================ ROCm System Management Interface ============================
=========================================== ID ===========================================
GPU[0]      : Device Name:      Navi 31 [Radeon RX 7900 XT/7900 XTX/7900M]
GPU[0]      : Device ID:        0x744c
GPU[0]      : Device Rev:       0xcc
GPU[0]      : Subsystem ID:     NITRO+ RX 7900 XT Vapor-X
GPU[0]      : GUID:         56961
GPU[1]      : Device Name:      Navi 31 [Radeon RX 7900 XT/7900 XTX/7900M]
GPU[1]      : Device ID:        0x744c
GPU[1]      : Device Rev:       0xcc
GPU[1]      : Subsystem ID:     0x5317
GPU[1]      : GUID:         28574
==========================================================================================
================================== End of ROCm SMI Log ===================================
Bratzmeister commented 1 day ago

I did change the code myself as suggested above. However there seems to be an issue with my dual-GPU setup as I receive a segfault after the model is loaded and inference is supposed to start. Maybe someone with a similar setup (integrated graphics might work too but I don't have any that's supported by ROCm) can test this.