city96 / ComfyUI_NetDist

Run ComfyUI workflows on multiple local GPUs/networked machines.
Apache License 2.0
311 stars 30 forks source link

Cancel does not work #5

Open picobyte opened 1 year ago

picobyte commented 1 year ago

I have a Tesla M10, 4 gpus, passively cooled, overheating easily. At 95C the gpu becomes unusable until reboot. Controlling which gpu is activated and which one cools down via your extension, has several problems:

city96 commented 12 months ago

Huh, interesting idea to round-robin the GPUs. Might even work on my cards. They don't reach shutdown temp but they do thermal throttle (P40s).

Anyway, I don't know if custom nodes get notified when a workflow gets cancelled, but I'll try to figure something out. I can only realistically mess with my multi-GPU setup on the weekend so I'll try to get back to you on this.

(There's a "rewrite" branch but I'm not sure that fixes either of your issues.)

picobyte commented 12 months ago

For the temperature issue I'll try a workaround using temperature protection. If you're interested in the attempted workflow to switch gpus: multi_gpu_test.json (currently not working). Possibly this can be done better. I am just starting with ComfyUI.

However I also wonder what the benefit is of one workflow to control the gpus versus running the ComfyUI multiple times. I think it would be better if dedicated tasks are dispatched to distinct GPUs, like one GPU for adding noise, another for UNET, one for reconstruction and maybe one for preview image generation[1], or something like that. Alternatively: subsequent cycles run on distinct GPUs. I mean this just as my (naive) concept of the ideal distribution if work. Or maybe averaging(?) of parallel run cycles or so. [1] https://huggingface.co/blog/stable_diffusion

city96 commented 12 months ago

Okay, so I tried making a round-robin node to switch the URLs but // is interpreted as a comment... I'll get back to this once I find out where the logic for it is in comfyui.

image

As for the cancel, I added some simple logic to clear it before starting a new job. Now, this isn't optimal since the job keeps running even after you cancel it. I guess I could break it out into a separate "cancel all jobs" node but it'd be much cleaner if there was a way for custom nodes to be notified when a workflow is canceled/interrupted. I already asked comfy so I guess we'll just have to wait for now.

(Sorry, the readme is still a mess, I'll try to clean it up and then I'll merge the rewrite branch into the main one if everything works.)