comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
54.71k stars 5.77k forks source link

RAM Allocation Problems after Update 1369 [968078b1] #1368

Open elphamale opened 1 year ago

elphamale commented 1 year ago

This update brought a lot of "DefaultCPUAllocator: not enough memory: you tried to allocate 37748736 bytes." errors. Most of the workflows became unusable. I have 16 Gb RAM and tried increasing pagefile - it helped to an extent but still any workflow that involves scarcely more than SDXL+refiner became unusable.

If there's a workaround PST.

comfyanonymous commented 1 year ago

If you look at the memory usage in task manager what does the graph look like?

elphamale commented 1 year ago

If you look at the memory usage in task manager what does the graph look like?

The graph looks like 85-95 percent RAM in use and at least 50% VRAM in use most of the workflow. On the moment of error it was 93%

!!! Exception during processing !!! Traceback (most recent call last): File "C:\StableDiffusion\ComfyUI\execution.py", line 151, in recursive_execute output_data, output_ui = get_output_data(obj, input_data_all) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\StableDiffusion\ComfyUI\execution.py", line 81, in get_output_data return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\StableDiffusion\ComfyUI\execution.py", line 74, in map_node_over_list results.append(getattr(obj, func)(**slice_dict(input_data_all, i))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\StableDiffusion\ComfyUI\custom_nodes\ComfyUIUltimateSDUpscale\nodes.py", line 124, in upscale processed = script.run(p=sdprocessing, =None, tile_width=tile_width, tile_height=tile_height, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\StableDiffusion\ComfyUI\custom_nodes\ComfyUI_UltimateSDUpscale\repositories\ultimate_sd_upscale\scripts\ultimate-upscale.py", line 553, in run upscaler.process() File "C:\StableDiffusion\ComfyUI\custom_nodes\ComfyUI_UltimateSDUpscale\repositories\ultimate_sd_upscale\scripts\ultimate-upscale.py", line 136, in process self.image = self.redraw.start(self.p, self.image, self.rows, self.cols) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\StableDiffusion\ComfyUI\custom_nodes\ComfyUI_UltimateSDUpscale\repositories\ultimate_sd_upscale\scripts\ultimate-upscale.py", line 243, in start return self.linear_process(p, image, rows, cols) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\StableDiffusion\ComfyUI\custom_nodes\ComfyUI_UltimateSDUpscale\repositories\ultimate_sd_upscale\scripts\ultimate-upscale.py", line 178, in linear_process processed = processing.process_images(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\StableDiffusion\ComfyUI\custom_nodes\ComfyUI_UltimateSDUpscale\modules\processing.py", line 116, in process_images (latent,) = vae_encoder.encode(p.vae, batched_tiles) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\StableDiffusion\ComfyUI\nodes.py", line 279, in encode t = vae.encode(pixels[:,:,:,:3]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\StableDiffusion\ComfyUI\comfy\sd.py", line 222, in encode model_management.free_memory(memory_used, self.device) File "C:\StableDiffusion\ComfyUI\comfy\model_management.py", line 323, in free_memory m.model_unload() File "C:\StableDiffusion\ComfyUI\comfy\model_management.py", line 294, in model_unload self.model.unpatch_model(self.model.offload_device) File "C:\StableDiffusion\ComfyUI\comfy\model_patcher.py", line 269, in unpatch_model self.model.to(device_to) File "C:\StableDiffusion\venv\Lib\site-packages\torch\nn\modules\module.py", line 1145, in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^ File "C:\StableDiffusion\venv\Lib\site-packages\torch\nn\modules\module.py", line 797, in _apply module._apply(fn) File "C:\StableDiffusion\venv\Lib\site-packages\torch\nn\modules\module.py", line 797, in _apply module._apply(fn) File "C:\StableDiffusion\venv\Lib\site-packages\torch\nn\modules\module.py", line 797, in _apply module._apply(fn) [Previous line repeated 7 more times] File "C:\StableDiffusion\venv\Lib\site-packages\torch\nn\modules\module.py", line 820, in _apply param_applied = fn(param) ^^^^^^^^^ File "C:\StableDiffusion\venv\Lib\site-packages\torch\nn\modules\module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 37748736 bytes.

Prompt executed in 142.45 seconds

NeedsMoar commented 1 year ago

What GPU are you using? Since the GPU memory usage isn't pegged at 100% I'll assume it's not AMD & directml, and since it's not higher I'm guessing you don't have less system ram than VRAM. A list of hardware you've got would help. :D

I suspect you're massively underestimating how much ram it takes to run these models... I was actually a bit surprised when I checked and got the numbers below... I'm not familiar enough with the internals of python but I know Windows will heavily try to avoid paging anything in an active process out because it could quickly become a shitshow of terrible performance as page faults build up, and it probably defines most things that have just been messed with and copied to hardware as recent use.

Anyway I wanted to get some real numbers out of this and have too much ram and a disabled pagefile except the minimum required to write minidumps and keep the OS happy for Windows Store style programs, so task managers can't lie too much about what's going on with a program's memory, especially if it's a normal windows program like the Python I'm running; Usually if I try to run SDXL using two KSampler nodes it OOMs on me because directml has no way of telling comfy how much memory it has left unless an allocation fails, but I just tried it on the Searge pack node for it and it maintained impressively low "real" GPU memory use the entire time and finished a 1536x1536 without complaints, so I gained something new to play with from your bug. Thanks!

System memory is still high though, more than I expected. Python doesn't efficiently store its data structures and safetensors files or checkpoints need to be loaded, converted to a big mess of weights and the model layers, compiled to run on whatever your GPU is... etc... and all of that stuff usually just sticks around through the run because it might be needed. So, the ~12-13GB of the model turns into 52GB

Notice on the graphs that when the VAE decode completed and the GPU processing was over with the memory immediately shot up; it needed another 3GB for some reason. Not sure what that's about.

Here's what system memory usage looks like during SDXL:

memory_at_fail_point

According to ProcessExplorer, Python is using 51.2GB in private bytes and 30GB of working set (measured at a slightly different point in the run so it won't match exactly), although I think task manager combines them to report its working set numbers. I'm not using very much memory outside of python right now.

As seen over at diffusers with the same error from somebody with the same 16GB of ram, your best choice is to spend the nearly nothing to buy more memory; 16GB is very low for a current computer; Windows will manage itself just fine with it, the OS isn't the resource hog people think it is... unfortunately 90% of software on Windows doesn't do memory management very well or at all while it's running. Most of the people on that bug just upgraded; ram is very cheap right now, at least to get it up to a more reasonable amount for modern computers. Based on the numbers I saw with SDXL, I'd get 64GB. If you're running DDR5 and have 4 slots you can't really install the second bank without slowing down all of it so replacing it might be better depending on how much the slowdown actually is, but you could get 80GB that way. Also pay attention to that thread about letting windows manage swap, it avoids having to guess the page size.

Finally if you have a GPU that supports it and resizable BAR is turned on, keep in mind that while it's in use the BAR area has to be mapped into a physical memory address so the CPU has somewhere to copy / instruct DMA disks to load to that the card can then pull it from as fast as it wants. That area isn't swappable for obvious reasons and can't be used for anything else.

NeedsMoar commented 1 year ago

Also FYI I just looked through the source of that custom upscale node; I've tried to use it before. It OOM'd on the GPU side with an SD1.5 model attempting an x2 upscale to 1920x1080 when tiling was turned on almost instantly. It creates at least 2x the latent space of the final size + >=2x the image space of the final size along with a tensor of the final size as an intermediate for each tile on CPU so it'll chew system memory pretty fast as well. To make that a little worse, it overrides the torch garbage collector class with an empty function with a pass statement which likely makes everything stick around until the node doesn't exist.

elphamale commented 1 year ago

Also FYI I just looked through the source of that custom upscale node; I've tried to use it before. It OOM'd on the GPU side with an SD1.5 model attempting an x2 upscale to 1920x1080 when tiling was turned on almost instantly. It creates at least 2x the latent space of the final size + >=2x the image space of the final size along with a tensor of the final size as an intermediate for each tile on CPU so it'll chew system memory pretty fast as well. To make that a little worse, it overrides the torch garbage collector class with an empty function with a pass statement which likely makes everything stick around until the node doesn't exist.

So, what I established: It is indeed a RAM problem and 16Gb became a little too short for the workflow I'm using after some build. The workaround for now is to allow 'system managed' pagefile that grows uncontrollably to ridiculous size over time. It allows me to run SDXL + 1.5 Detailer + Ultimate SD Upscaler.

I may have been wrong using the build number in the title, because despite me trying to update daily, it may have been caused by some other change prior to this build. I tried looking up activities to see what change may have caused it but I can't pinpoint the one.

bash-j commented 1 year ago

I was looking into RAM usage last night. I tried creating a custom node to free up RAM, but I just couldn't get it to work. When a model is unloaded, it's moved to RAM. I couldn't find a way to remove it from memory completely without causing other issues. If I tried deleting the model, the next time I ran the workflow it would have an error like None doesn't have... which didn't seem to make sense since current_loaded_models starts off as an empty list, so why does it expect it to be populated on the 2nd batch.