comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
53.54k stars 5.67k forks source link

Add a launch argument for non_blocking=True #5268

Open mturnshek opened 1 day ago

mturnshek commented 1 day ago

Feature Idea

In model_management.py

def device_should_use_non_blocking(device):
    if not device_supports_non_blocking(device):
        return False
    return False
    # return True #TODO: figure out why this causes memory issues on Nvidia and possibly others

https://github.com/comfyanonymous/ComfyUI/blob/7390ff3b1ec2e15017ba4a52d6eaabc4aa4636e3/comfy/model_management.py#L834

Changing this function back to its pre-TODO state results in a large speedup in model patching. (19s -> 6s for Flux LoRAs on my computer.) It probably also speeds up loading in other areas.

This is because the largest bottleneck is the one-by-one blocking transfer of each layer of the unet to the GPU, which is massively accelerated if non_blocking=True.

Are there still memory issues? Changes like this (https://github.com/comfyanonymous/ComfyUI/commit/39f114c44bb99d4a221e8da451d4f2a20119c674) since this TODO was written could mean the same problems that used to cause memory issues may be less relevant than before or non-existent.

Please consider re-adding support for non_blocking=True as a launch argument so users can start trying it out again.

Existing Solutions

No response

Other

No response

comfyanonymous commented 1 day ago

Can you check if it's better now?

mturnshek commented 1 day ago

I can confirm that ComfyUI is way faster for both loading and model patching for me now as of your commit from 40 minutes ago 67158994a4356d0ec54aaf3bbc5619c6c119f540.

Loading the base model has gone from 40s to about 5 seconds for me. Probably 3-8x speedup overall for patching and loading which are huge bottlenecks.

mturnshek commented 1 day ago

By the way, have you ever thought about / looked into combining similar layer patches into one tensor before casting them and transferring to the device in this section of model_patcher.py?

        load_completely.sort(reverse=True)
        for x in load_completely:
            n = x[1]
            m = x[2]
            weight_key = "{}.weight".format(n)
            bias_key = "{}.bias".format(n)
            if hasattr(m, "comfy_patched_weights"):
                if m.comfy_patched_weights == True:
                    continue

            self.patch_weight_to_device(weight_key, device_to=device_to)
            self.patch_weight_to_device(bias_key, device_to=device_to)
            logging.debug("lowvram: loaded module regularly {} {}".format(n, m))
            m.comfy_patched_weights = True

They're all done individually since there are differently shaped layers and it's easy. But if they were combined so that all similar layers were sent over as one stacked tensor, the number of calls would be reduced from something like 429 to ~14 for Flux's unet, with much more data per call.

I'm not sure how efficient sending lots of small tensors over is compared to grouping and sending it as a larger block, but it could be significant.