Support for comfyui-faster-loading node for faster loading checkpoints

freecoderwaifu commented 3 weeks ago

There's a long standing issue with Python and/or Windows memmap that causes safetensors loading to be really slow in certain situations.

https://github.com/comfyanonymous/ComfyUI/issues/1992

A1111 fixed this a while ago with custom memmaping, and this custom Comfy UI node also seems to help substantially with loading speeds, with no noticeable downsides.

https://github.com/nonnonstop/comfyui-faster-loading

It would really help a lot if this loading method was supported by Krita AI Diffusion, as it decreases loading speeds from about 3 min for SDXL to about 40 seconds, and from about 1m 15s to about 12 seconds for SD 1.5 , at least on HDD.

Acly commented 2 weeks ago

As far as I can see that "custom node" works without changes in the Krita plugin, as long as you have it installed.

It's a somewhat hacky solution so I'm not sure about installing it by default. The tradeoff is probably that it needs more RAM (briefly for loading). Not an issue for most people, but who knows I also thought nobody was using HDDs anymore

richardm1 commented 2 weeks ago

I hypothesize this can impact SATA SSD's along with people pulling checkpoints across a LAN connection.

Soapbox mode: Ever since SSDs went mainstream 12-15 years ago I've feared coders would generally stop caring about efficient storage I/O given that devices with sub-millisecond latency can cover for an enormous number of sins. ComfyUI's slow checkpoint loading for some users might be a case study. Rant over.

IMHO A1111 is a role model for checkpoint loading. On my system it uses 1MB block I/O and it reads the entire file in a perfectly contiguous manner; fully saturating my disk subsystem (SATA SSD). ComfyUI fragments checkpoint reads all over the place, skipping around to various offsets within the file and apparently ruining Windows' efforts at read-ahead caching. Most of the I/O is 32k.

Whose arm can we twist to perhaps bring Comfy's default checkpoint loads up to the A1111 performance level?

freecoderwaifu commented 2 weeks ago

As far as I can see that "custom node" works without changes in the Krita plugin, as long as you have it installed.

It's a somewhat hacky solution so I'm not sure about installing it by default. The tradeoff is probably that it needs more RAM (briefly for loading). Not an issue for most people, but who knows I also thought nobody was using HDDs anymore

Man, I really should have installed it and tested beforehand before posting about it (yeah I'm that dumb). It does work on its own by just installing it on a custom Comfy UI install.

I guess this github issue could still serve as feedback or suggestion for its possible inclusion in the local managed server, or even for suggesting better alternatives, but I agree on the considerations for it.

HDD is slow but still cheaper for model hoarders (yeah me).

Acly commented 2 weeks ago

Ever since SSDs went mainstream 12-15 years ago I've feared coders would generally stop caring about efficient storage I/O given that devices with sub-millisecond latency can cover for an enormous number of sins.

I guess what you call a sin, others would call a feature.

It's always a trade-off, mem-mapping and reading arbitrary positions requires less RAM, skips the time needed to fully read the file, and potentially allows you to load only what you need. But terrible for HDDs, and I'm also not claiming it's the best fit for Comfy here. Just seems to be the default for the safetensors library (which does care, and probably measured load times, for some systems).

I optimized pulling large files that people kept on 1) NVMe SSDs 2) HDDs 3) network shares once, and there is no one method that is best for all. Then switch to Linux and everything looks different again.

I'd agree that loading into RAM with modest block size is probably a safer default.

richardm1 commented 2 weeks ago

I finally got around to "sniffing" Comfy's filesystem I/O with Procmon during a checkpoint load -- this time with the comfyui-faster-loading module installed. What was a dumpster fire of 32k I/Os is now 2MB I/Os and perfectly sequential.

Also fed to an LLM a list of Comfy's file read offsets (when in "slow mode") and asked if any patterns stood out:

First offset: 1,007,333,376 Last offset: 1,999,732,736 Number of offsets: 185

Most common intervals:

13,127,680 bytes (seen 31 times) 9,838,592 bytes (seen 30 times) 4,718,592 bytes (seen 29 times)

Less common intervals:

3,280,896 bytes (seen 6 times) 528,384 bytes (seen 8 times) 294,912 bytes (seen 11 times)

Anomalies:

36,864 bytes (seen once, between 1,637,011,456 and 1,637,048,320) Several other unique intervals in the latter part of the sequence

Pattern Comparison:

Initial pattern (similar to previous sequence):

Increase by 13,127,680 Increase by 9,838,592 (Repeat)

Middle section (new pattern): Consistent increases of 4,718,592

Later section (new pattern):

More varied intervals, including 528,384 and 294,912

Similarities with Previous Sequence:

The initial pattern of alternating 13,127,680 and 9,838,592 intervals is present. The anomaly of 36,864 bytes appears once, similar to the previous sequence. The common intervals are still multiples of 4096 bytes (page size).

Differences from Previous Sequence:

New patterns emerge in the latter part of the sequence. More varied intervals appear, especially towards the end.

Observations:

The sequence seems to transition through different reading patterns. The latter part of the sequence shows more frequent, smaller jumps. The end of the sequence has some very small intervals (e.g., 819,200 bytes), suggesting possible fine-grained reading or metadata processing

ComfyIO-blender.😩

Acly / krita-ai-diffusion

Support for comfyui-faster-loading node for faster loading checkpoints #1110

Most common intervals:

Less common intervals:

Anomalies:

Pattern Comparison:

Differences from Previous Sequence:

Observations: