city96 / ComfyUI_NetDist

Run ComfyUI workflows on multiple local GPUs/networked machines.
Apache License 2.0
311 stars 30 forks source link

How can I set the set the CUDA device (i.e. --cuda-device 0 or 1) #16

Open Luxcium opened 7 months ago

Luxcium commented 7 months ago

Issue: Inconsistency in CUDA Device Setting (Device 1 vs. Device 0)

Technical Environment as per ComfyUI:

Screenshot_20240222_062131

Originally posted by @Luxcium in https://github.com/comfyanonymous/ComfyUI/issues/2396#issuecomment-1959245545

Issue Description:

I am experiencing an inconsistency when attempting to assign operations to a specific CUDA device. Despite explicitly setting the CUDA device to 1, the system reports that it is utilizing device 0, as indicated in the following output:

Set cuda device to: 1
...
Device: cuda:0 NVIDIA TITAN Xp COLLECTORS EDITION : cudaMallocAsync

This discrepancy is concerning, particularly because it has led to a complete system shutdown, similar to what one would experience during a power outage. Initially, this made me question whether dual GPU operation was feasible, but the problem persists as a misdirection of processing to a single GPU.

The exact location in the code where this device assignment discrepancy occurs is unclear to me. My hypothesis was that it could be addressed in comfy/model_management.py#L73, but my attempts to resolve it have not been successful.

This issue has left me at an impasse, unable to determine a solution, either for a local fix or for a pull request. I am hoping that this description will bring to light an easily correctable configuration error for those more familiar with the intricacies of CUDA device management.


My assistant is not always perfect but he helped me to write this issue delving into the realm of tapestry where a symphony of... well you get the idea...

city96 commented 7 months ago

The comfyui argument works by setting CUDA_VISIBLE_DEVICES here in main.py.

So if CUDA_VISIBLE_DEVICES isn't working then the command line argument also won't work. I remember seeing some complains about January 2024 nvidia drivers being flaky, so you could try up/downgrading if you're on one of those versions.

You could also try and change main.py to use torch.cuda.set_device instead, but I'm not sure if it would work, I'm not at my PC at the moment so I can only guess lol. Still, you could try and modify it like this:

if __name__ == "__main__":
    if args.cuda_device is not None:
        import torch
        torch.cuda.set_device(f"cuda:{args.cuda_device}")
        # os.environ['CUDA_VISIBLE_DEVICES'] = str(args.cuda_device)
        print("Set cuda device to:", args.cuda_device)