comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
56.07k stars 5.93k forks source link

Speed up the first run. #1992

Open Robson1970 opened 11 months ago

Robson1970 commented 11 months ago

How to boost up the first run up to 800% especially with hard drives. Just copy the checkpoint from your HDD to the SDD, or make a copy itself in the same folder and then delete the "copy from the checkpoint". Example: revAnimated_v122EOL.safetensors 5GByte takes 4 minutes to load the checkpoint compared to 30 sec. after copy the checkpoint from the HDD to the SSD and then delete the file. ComfyUI start the checkpoint 8 times faster.

NeedsMoar commented 11 months ago

This isn't an issue with comfyui. 4 minute load times from a hard drive indicate an issue with the hard drive though.

jn-jairo commented 11 months ago

Just copy the checkpoint from your HDD to the SDD, or make a copy itself in the same folder and then delete the "copy from the checkpoint".

When you access a file the O.S. keeps a cache of it on RAM, so when you copy the file the O.S. loads it on RAM and when you try to access it later on ComfyUI it is already on RAM.

There is nothing to fix, if you want a faster transfer to RAM put the file on a faster drive.

Robson1970 commented 11 months ago

jn-jairo

My config: 13900KS: 2x Red HDD 22Tb: 3xKingston KC3000 SSD 4TB : 64TB DDR5 7000MHz: RTX 4090 I keep my Ceckpoints on the HDD because over 18TB Thats, what I mean, preload the checkpoint first and then extract the file and not extract the file from the drive.

NeedsMoar commented 11 months ago

Checkpoints should load in ~15s from those hard drives... I don't think they're quite 300MB/s but they're up there. That's kinda why I said something was wrong, 4 minutes is 1990s IDE drive speeds. The 18GB 15k Cheetahs I had in ~2001/2002 would have managed it in under a minute (if the computer had anywhere near enough ram to load a 5GB file back then anyway).

Copying a file to another drive will probably put it in cached memory as far as the opening program is concerned, but when you delete it, you're killing the file you loaded from so Windows (and anything else) has to invalidate the standby memory / cache (meaning it's freed as soon as comfy closes it rather than being kept around until you run out of ram.

Are you saying you have 18TB of checkpoints (actual) on one 22TB (advertised) drive? How much free space is it showing? You could be hitting MFT issues depending on cluster size.

Robson1970 commented 11 months ago

Are you saying you have 18TB of checkpoints (actual) on one 22TB (advertised) drive? How much free space is it showing? You could be hitting MFT issues depending on cluster size.

1 Drive only for ComfyUI ,currently 2.18 TB free on the HDD. I can copy the 5 GB file Checkpoint from the HDD in around 30sec. on the Desktop SSD. Once the file is copied ComfyUI starts immediately with the checkpoint. I made a little video from copy the file and ComfyUI with the"clear" Ram.

https://files.catbox.moe/9vze6v.mp4

jn-jairo commented 11 months ago

Thanks for the video, I misunderstood it, yeah there is something weird going on, even my potato laptop can load faster and I also store the checkpoints on HD because it has more space.

Robson1970 commented 11 months ago

Test with Stable Diffusion, same HDD drive and same the checkpoint. Model loaded in 305 sec. apply weights to model 298 sec.

1

Disable memmapping for loading .safetensors files.

3

New Test, load weights in 37sec. like 8 times faster. 2

NeedsMoar commented 11 months ago

Edit: I missed your post. That's the last thing I always forget that I hit with Nod.ai Shark on an NVMe. Python has an absolutely horrid implementation of memory mapped files on Windows because they sorta blindly tried to implement it as much like the linux version as possible. Nod wasn't that bad but couldn't even use models from standby because of the way they were loaded. I'd completely forgotten this was an issue sometimes. They never quite figured out what triggered it over there.

Yeah it kinda threw up an alarm, there aren't many possibilities here, I'll just go from best news to worst news.
1) Firstly, windows explorer displayed speeds don't really represent the speed the drive is reading / writing at the time. If there's enough ram to cache the file during copy it'll show you as fast a speed as possible and finish in the background to keep the UI responsive. You'll need to look at task manager or perfmon to really see what reads from that drive are doing. That drive may be slowly defragmenting in the background, or even worse compressing files if that's turned on, and delaying copies, it's hard to say. You can use sysinternals rammap or other tools to try to view what's going on with disk caches but I kinda don't think that's the issue since 200MB/s is about what I'd expect out of a mostly full current gen drive.

2) I don't know the arrangement of your motherboard but those Intels only have 20 PCIe lanes. It doesn't really matter how they allocated them, gen 4 NVMes will still eat up 12 lanes. Most of the time to give some appearance of things actually working motherboard manufacturers will throw a couple of onboard M.2 slots on the southbridge instead of putting them on the CPU and having them eat into the x16 GPU slot but it's hard to say. Two gen 4 NVMes will probably saturate the southbridge if they're both trying to do a bunch of work too, and that's where the SATA bandwidth needs to be. I'm not familiar with modern Intel CPUs, either, they may just have trouble spreading bandwidth around properly because of some weirdness of the performance core thing. Copying from stuff on the other side of the southbridge which is a PCIe switch to the GPU might be insanely more CPU-heavy than copying from an NVMe on the CPU side would be. It could also be AHCI turned off in bios for some reason, or a loose cable. It's hard to say.

3) The most likely suspect to me would be early drive failure. I haven't had any of the newer generation helium drives die on me yet but I'm not using anything but datacenter drives and the majority of them are SAS 12Gb/s which tend to be made better. You can end up with slowdowns from an NTFS disk being that full (Windows has problems defragmenting it properly or expanding the MFT) but that generally doesn't happen unless you formatted with a small cluster size and have millions of files, and not to the level you're seeing.

I'd strongly suggest checking event viewer for controller errors, and downloading smartctl and running extended self tests on both drives. SATA SMART isn't all that useful IMO but you should see it throwing checksum errors / read re-attempts if it's failing and that's something else to help troubleshoot. Everything else is outside the scope of the comfy issues page because there's obviously some kind of hardware issue going on.

Robson1970 commented 11 months ago

@ NeedsMoar

Thanks for the detailed post and given me the hints.

I have same issue on my second drive with the safetensors. ComfyUI load the .ckpt checkpoint around 10x faster than the .safetensors files.

It´s very strange, some safetensors files are faster than other safetensors files. Maybe there's an option in the safetensors how to handle the file ?

From the AUTOMATIC1111 there was a patch for loading the safetensors files. https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/11216 https://github.com/huggingface/safetensors/pull/140

weights

Preload the safetensors in the RAM should fix it.

jn-jairo commented 11 months ago

So, let's try the A1111 method, change this line https://github.com/comfyanonymous/ComfyUI/blob/d9d8702d8dd2337c64610633f5df2dcd402379a8/comfy/utils.py#L13

to this

        sd = safetensors.torch.load(open(ckpt, 'rb').read())

and see if this solves your issue.

Robson1970 commented 11 months ago

So, let's try the A1111 method, change this line

https://github.com/comfyanonymous/ComfyUI/blob/d9d8702d8dd2337c64610633f5df2dcd402379a8/comfy/utils.py#L13

to this

        sd = safetensors.torch.load(open(ckpt, 'rb').read())

and see if this solves your issue.

You are godlike! Master of scripts !

You saved me from buying 20TB of SSD.

The safetensors files load faster than I expected. The same checkpoint before load now in 29 sec. instead of 4min. The 16GB safetensor load at >220GB/s - 280GB/s from the HDD in 76sec.

You are the time saver, the load time of safetensors are 8-10 faster now. It's like christmas. Cheers, mate.

speed

freeAhao commented 10 months ago

So, let's try the A1111 method, change this line

https://github.com/comfyanonymous/ComfyUI/blob/d9d8702d8dd2337c64610633f5df2dcd402379a8/comfy/utils.py#L13

to this

        sd = safetensors.torch.load(open(ckpt, 'rb').read())

and see if this solves your issue.

Huge Thanks!!!

cheadrian commented 9 months ago

So, let's try the A1111 method, change this line https://github.com/comfyanonymous/ComfyUI/blob/d9d8702d8dd2337c64610633f5df2dcd402379a8/comfy/utils.py#L13

to this

        sd = safetensors.torch.load(open(ckpt, 'rb').read())

I can confirm this increased the speed of reading from the HDD x6 times, from about 15 MB/s to 120 MB/s.

Luxcium commented 8 months ago

ChatGPT has a good way to clarify the solution mentioned and its implications on performance:

  1. safetensors.torch.load(open(ckpt, 'rb').read()):

    • This method first reads the entire file content into memory as a byte string by using open(ckpt, 'rb').read(), and then it passes this byte string to safetensors.torch.load to convert it into tensors.
    • This process is typically faster for loading data because it minimizes disk I/O operations by loading the file content into memory in one go, before converting it to tensors. It's a two-step process where data is first read into memory and then processed.
  2. safetensors.torch.load_file(ckpt, device=device.type):

    • This method directly loads tensors from a safetensors file into the specified device (e.g., CPU or GPU) without explicitly reading the file content into memory first.
    • It is designed to be convenient for directly loading data onto a specific device but may introduce additional overhead because it handles both file reading and data transferring in one step, potentially leading to slower performance compared to the first method.

In summary:

ChatGPT Sources: Torch API and Speed Comparison.

marduk191 commented 8 months ago

Can we get this is main so we don't have to modify utils on every update by chance?

nonnonstop commented 4 months ago

I made the extension to monkeypatch. https://github.com/nonnonstop/comfyui-faster-loading

cgrossi commented 4 months ago

I still don't understand exactly why this isn't an issue for most people, yet some of us need to have this fix or we suffer extremely long load times every time we switch a checkpoint.

Also I find a case where the fix actually screws up loading of something else now. If you make the sd = safetensors.torch.load(open(ckpt, 'rb').read()) change, checkpoints will load faster, but if you try to use a control lora model, you will get an error, I'm not sure why this is...

freecoderwaifu commented 4 months ago

I still don't understand exactly why this isn't an issue for most people, yet some of us need to have this fix or we suffer extremely long load times every time we switch a checkpoint.

Also I find a case where the fix actually screws up loading of something else now. If you make the sd = safetensors.torch.load(open(ckpt, 'rb').read()) change, checkpoints will load faster, but if you try to use a control lora model, you will get an error, I'm not sure why this is...

Python/Windows' memory mapping, but more likely Python. This was fixed in A1111 with a different mapping/loading method as mentioned in the thread. It might be worse the more RAM you have too, with 64GB it's very slow, as in 2-3 minutes to load an SDXL checkpoint vs around 40s with the fix.

One dumb way to fix it without changing code, is to force Windows to map the file to RAM. You can do this simply by hashing it. You can use 7-zip's built in hasher, use the fastest algo which I think is CRC-32. You can also use the .reg in this link to add hashing to right click, and use MD5 which should also be faster.

https://www.tenforums.com/tutorials/78681-add-file-hash-context-menu-windows-8-10-a.html

brendanhoar commented 2 months ago

Looks like we have the same symptoms for gguf files, since the gguf_reader.py library also uses memmap. :(

richardm1 commented 2 months ago

I'm wondering if there's room for further optimization. During a safetensor checkpoint load I would expect my SATA SSD to be fully saturated (100% disk active time and disk queue length near or above 1.0) and a sustained throughput of around 520-540MB/s. Instead I'm seeing around 80-84% active and 380ish MB/s throughput. Sometimes a fair bit lower.

Procmon reveals mostly 32k I/O for Comfy's checkpoint reads with a handful of larger I/Os up to 1MB. The reads jump around to different parts of the file skipping sections then backing up repeatedly. Most of the reads are discontiguous and likely screwing over the Windows' read-ahead caching algo. IMHO loading checkpoints from spinning rust has got to suuuuuuck.

OTOH, A1111 saturates the disk regardless of mmap ON or OFF. Exploring its checkpoint file ingestion with Procmon I see 100% 1MB transfers perfectly contiguous from offset zero to EOF.

What is A1111 doing here that Comfy isn't?

brendanhoar commented 2 months ago

@richardm1: In A1111, do you have the optional "Disable memmapping for loading .safetensors files. (fixes very slow loading speed in some cases)" setting enabled?

In Comfy do you have the "https://github.com/nonnonstop/comfyui-faster-loading" node installed?

I do find the access patterns on the memmapped file interesting, I don't know why the small chunks are used and why they are nearly consistently read out of order, might have to do with multi-threading in the DMA-COPY-RAM-to-VRAM code somewhere in CUDA?

richardm1 commented 2 months ago

I've tried A1111 both ways -- it fully saturates the disk with 520-540MB/s throughput regardless of memmap.

Regarding Comfy I've just installed comfyui-faster-loading and it is indeed significantly faster. I don't think it's a 100% fix as the throughput bounces around somewhat (unlike A1111 which holds the disk pegged to 100%). But this is much better!

I see some chatter regarding this workaround and WSL -- I am not running WSL here. I'm wondering if the reason why only some people experience slow safetensor loads is due to SATA vs low-end NVMe vs higher-end NVMe. Pure speculation here from a former storage admin: This might be an edge case where NVMe drives with their larger queue depth can leverage native command queueing (NCQ) and scatter-gather to better reorder the incoming dumpster fire of choppy I/Os into something more sane and sequential. I believe SATA drives are universally QD32; NVMe starts at QD64 and the sky is the limit (QD65536 max defined in the protocol). Just a SWAG.

Command queues aside, the other thing with small, blendery I/Os is they underscore the lower latency of NVMe technology. All that said there's a lot of SATA in the wild still...

Thanks for your help with this!

richardm1 commented 2 months ago

Wondering if this read access pattern rings any bells.

zhangxiying commented 2 months ago

This method is applicable to Linux. My ComfyUI is deployed on Linux, and every time I switch models, it is very slow. Some take 1 minute, while others take more than 2 minutes.

Amit30swgoh commented 2 months ago

This method is applicable to Linux. My ComfyUI is deployed on Linux, and every time I switch models, it is very slow. Some take 1 minute, while others take more than 2 minutes.

me too same i use colab via trycloudflare.com

richardm1 commented 2 months ago

So I got tired of managing my AI stuff spread across four drives. It now lives on a server at the other end of 10Gb Ethernet. I'm doing iSCSI from my desktop and I've confirmed this setup will read/write at full wire speed (around 1.2GB/sec). The comfyui-faster-loading node still manages to cut checkpoint load times in half (or better) with this setup.

brendanhoar commented 1 month ago

Just as a note: Flux GGUF files have the same issues on some systems and unlike with the safetensors memmap (workaround here: https://github.com/nonnonstop/comfyui-faster-loading ), there is no current workaround for the excruciatingly slow loading of GGUF models on some systems (other than perhaps retool the computer's storage subsystem).

richardm1 commented 1 month ago

Just as a note: Flux GGUF files have the same issues on some systems and unlike with the safetensors memmap (workaround here: https://github.com/nonnonstop/comfyui-faster-loading ), there is no current workaround for the excruciatingly slow loading of GGUF models on some systems (other than perhaps retool the computer's storage subsystem).

I'm beginning to play with Flux and I've noticed the same thing. I'm checking if Windows disk caching has any pre-fetch (read-ahead) knobs I can twist. Nothing so far.

I see Python's mmap supports madvise() where a read can be "pre-declared" to the kernel as sequential thus hinting storage prefetch get jiggy wit' it. But the damn reads have to be memory mapped in the first place -- if they were we wouldn't be having this conversation.

I think the following sets an upper limit to an adaptive value so it might not help. But... Linux users might try something like:

echo 13076 > /sys/block/sdx/queue/read_ahead_kb

...where sdx represents the block (disk) device in play.

BTRFS has:

/sys/fs/btrfs/<UUID>/bdi/read_ahead_kb

Ext4 has:

sys/fs/ext4/<devname>/inode_readahead_blks -- I don't know how many bytes are in an inode and I'm too tired to look it up.

With ZFS I plan to try:

echo 13193216 > /sys/module/zfs/parameters/zfetch_max_distance

This is a stupid large read-ahead that's likely to screw with overall cache efficacy on a general purpose PC. However, based on my Windows procmon64 captures (and assuming Comfy's pathological disk reads on Linux mirror the Windows pathology) it's enough to bridge the largest of the three "mystery gaps" in Comfy's data reads with 256k to spare. Those gaps being 13,127,680, 9,838,592, and 3,276,800 bytes in repeating intervals.

In the end I wonder if the stupidest workaround is the correct one: copy the entire checkpoint to /dev/null immediately prior to safetensors.torch.load_file() thereby forcing the whole thing into the OS disk cache assuming one has RAM sufficient for such shenanigans.

brendanhoar commented 1 month ago

Just as a note: Flux GGUF files have the same issues on some systems and unlike with the safetensors memmap (workaround here: https://github.com/nonnonstop/comfyui-faster-loading ), there is no current workaround for the excruciatingly slow loading of GGUF models on some systems (other than perhaps retool the computer's storage subsystem).

I'm beginning to play with Flux and I've noticed the same thing. I'm checking if Windows disk caching has any pre-fetch (read-ahead) knobs I can twist. Nothing so far.

There are some ways to tell windows to do it on a per-file level in the low level WinAPI. I'd have to dig out prior resea^H^H^H^H^H^H search engine searches...but they would likely involve modifying the internal plumbing in both the gguf and safetensors readers and, well...

...

In the end I wonder if the stupidest workaround is the correct one: copy the entire checkpoint to /dev/null immediately prior to safetensors.torch.load_file() thereby forcing the whole thing into the OS disk cache assuming one has RAM sufficient for such shenanigans.

TIL that Windows has an equivalent to /dev/null, which is the virtual file "nul" (one 'l'), that works at any path. Copying a model to that filename using appropriate large buffer sizes should be sufficient to preload the cache on impacted systems (well, those with enough RAM anyway):

https://gcc.gnu.org/legacy-ml/gcc-patches/2005-05/msg01793.html

So, utilizing NUL, a new node with method overrides that pre-reads each model before loading it would be possible without actually using significant python memory during the pre-read. That could help. Wouldn't be 100% efficient but for impacted systems, I'd wager total load time to the GPU would probably be within 10% of unaffected systems load time, a big win particularly for larger models.

I could also see using this on an async thread looking at upcoming items in the Queue to pre-load the next big model (into OS cache ahead of the particular inference request on that new model) being a potential generalized throughput win, say, for grids.

-B

richardm1 commented 4 weeks ago

TIL that Windows has an equivalent to /dev/null, which is the virtual file "nul" (one 'l'), that works at any path

It goes back at least to DOS 3.21 (1987). Can't vouch for anything earlier.⌛