comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
53.34k stars 5.65k forks source link

Pixelated output or high Vram with AMD GPU img2video workflow #3149

Open klaasvk opened 6 months ago

klaasvk commented 6 months ago

There seems to be this weird issue with card / setup / rocm i'm using. From allocating tremendous amounts of vram (40GB) to other weird things at the Ksampler or upscaler.

Im trying to use this super popular workflow, which alot of people got running on even 6-8gb vram cards: https://civitai.com/models/335070/simple-lcm-img2vid-workflow-or-comfyui

image

This is the Comfy workflow generating my result: workflow (5).json

The issue is like this: when I lower the frames to a low amount or the resolution I get really weird / off outputs. One time it's just pixels and the other it's a spacious looking version of my init with some pixels. Then when I use abit of a higher amount of frames or res which my even 8gb should be able to handle. It tries allocating huge amounts of vram looking like some memory leak. (which might be cus old rocm version?)

Im using the unofficially supported by rocm card rx5700xt. Used this method to run comfy: https://github.com/comfyanonymous/ComfyUI/discussions/1119 I use this to run it on linux: HSA_OVERRIDE_GFX_VERSION=10.3.0 python main.py --force-fp32 --disable-smart-memory --novram --use-split-cross-attention Now it runs some workflows and even sdv ones but once it comes to something with LCM, Ipadapters and loras like the one I wanna run it doesn't work.

An example of an said error log: Error occurred when executing KSampler (Efficient):

HIP out of memory. Tried to allocate 9.66 GiB (GPU 0; 7.98 GiB total capacity; 2.91 GiB already allocated; 4.90 GiB free; 3.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

File "/home/barryp/ComfyUIF/execution.py", line 151, in recursive_execute output_data, output_ui = get_output_data(obj, input_data_all) File "/home/barryp/ComfyUIF/execution.py", line 81, in get_output_data return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True) File "/home/barryp/ComfyUIF/execution.py", line 74, in map_node_over_list results.append(getattr(obj, func)(slice_dict(input_data_all, i))) File "/home/barryp/ComfyUIF/custom_nodes/efficiency-nodes-comfyui/efficiency_nodes.py", line 713, in sample samples, images, gifs, preview = process_latent_image(model, seed, steps, cfg, sampler_name, scheduler, File "/home/barryp/ComfyUIF/custom_nodes/efficiency-nodes-comfyui/efficiency_nodes.py", line 601, in process_latent_image samples = KSampler().sample(latent_upscale_model, hires_seed, hires_steps, cfg, sampler_name, scheduler, File "/home/barryp/ComfyUIF/nodes.py", line 1369, in sample return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise) File "/home/barryp/ComfyUIF/nodes.py", line 1339, in common_ksampler samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, File "/home/barryp/ComfyUIF/custom_nodes/ComfyUI-AnimateDiff-Evolved/animatediff/sampling.py", line 365, in motion_sample latents = wrap_function_to_inject_xformers_bug_info(orig_comfy_sample)(model, noise, *args, *kwargs) File "/home/barryp/ComfyUIF/custom_nodes/ComfyUI-AnimateDiff-Evolved/animatediff/utils_model.py", line 377, in wrapped_function return function_to_wrap(args, kwargs) File "/home/barryp/ComfyUIF/comfy/sample.py", line 100, in sample samples = sampler.sample(noise, positive_copy, negative_copy, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed) File "/home/barryp/ComfyUIF/comfy/samplers.py", line 705, in sample return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed) File "/home/barryp/ComfyUIF/comfy/samplers.py", line 610, in sample samples = sampler.sample(model_wrap, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar) File "/home/barryp/ComfyUIF/comfy/samplers.py", line 548, in sample samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, self.extra_options) File "/home/barryp/ComfyUIF/sdxl/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, *kwargs) File "/home/barryp/ComfyUIF/comfy/k_diffusion/sampling.py", line 745, in sample_lcm denoised = model(x, sigmas[i] s_in, extra_args) File "/home/barryp/ComfyUIF/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/barryp/ComfyUIF/comfy/samplers.py", line 286, in forward out = self.inner_model(x, sigma, cond=cond, uncond=uncond, cond_scale=cond_scale, model_options=model_options, seed=seed) File "/home/barryp/ComfyUIF/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/home/barryp/ComfyUIF/comfy/samplers.py", line 273, in forward return self.apply_model(args, kwargs) File "/home/barryp/ComfyUIF/comfy/samplers.py", line 270, in apply_model out = sampling_function(self.inner_model, x, timestep, uncond, cond, cond_scale, model_options=model_options, seed=seed) File "/home/barryp/ComfyUIF/custom_nodes/ComfyUI-AnimateDiff-Evolved/animatediff/sampling.py", line 407, in evolved_sampling_function cond_pred, uncond_pred = sliding_calc_cond_uncondbatch(model, cond, uncond, x, timestep, model_options) File "/home/barryp/ComfyUIF/custom_nodes/ComfyUI-AnimateDiff-Evolved/animatediff/sampling.py", line 519, in sliding_calc_cond_uncond_batch sub_cond_out, sub_uncond_out = comfy.samplers.calc_cond_uncond_batch(model, sub_cond, sub_uncond, sub_x, sub_timestep, model_options) File "/home/barryp/ComfyUIF/comfy/samplers.py", line 224, in calc_cond_uncond_batch output = model.apply_model(inputx, timestep, c).chunk(batch_chunks) File "/home/barryp/ComfyUIF/comfy/model_base.py", line 96, in apply_model model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, extra_conds).float() File "/home/barryp/ComfyUIF/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/barryp/ComfyUIF/custom_nodes/SeargeSDXL/modules/custom_sdxl_ksampler.py", line 70, in new_unet_forward x0 = old_unet_forward(self, x, timesteps, context, y, control, transformer_options, kwargs) File "/home/barryp/ComfyUIF/comfy/ldm/modules/diffusionmodules/openaimodel.py", line 850, in forward h = forward_timestep_embed(module, h, emb, context, transformer_options, time_context=time_context, num_video_frames=num_video_frames, image_only_indicator=image_only_indicator) File "/home/barryp/ComfyUIF/custom_nodes/ComfyUI-AnimateDiff-Evolved/animatediff/sampling.py", line 104, in forward_timestep_embed x = layer(x, context, transformer_options) File "/home/barryp/ComfyUIF/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, *kwargs) File "/home/barryp/ComfyUIF/comfy/ldm/modules/attention.py", line 633, in forward x = block(x, context=context[i], transformer_options=transformer_options) File "/home/barryp/ComfyUIF/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/barryp/ComfyUIF/comfy/ldm/modules/attention.py", line 460, in forward return checkpoint(self._forward, (x, context, transformer_options), self.parameters(), self.checkpoint) File "/home/barryp/ComfyUIF/comfy/ldm/modules/diffusionmodules/util.py", line 191, in checkpoint return func(inputs) File "/home/barryp/ComfyUIF/comfy/ldm/modules/attention.py", line 520, in _forward n = self.attn1(n, context=context_attn1, value=value_attn1) File "/home/barryp/ComfyUIF/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/home/barryp/ComfyUIF/comfy/ldm/modules/attention.py", line 412, in forward out = optimized_attention(q, k, v, self.heads) File "/home/barryp/ComfyUIF/comfy/ldm/modules/attention.py", line 282, in attention_split raise e File "/home/barryp/ComfyUIF/comfy/ldm/modules/attention.py", line 256, in attention_split s1 = einsum('b i d, b j d -> b i j', q[:, i:end].float(), k.float()) * scale File "/home/barryp/ComfyUIF/sdxl/lib/python3.10/site-packages/torch/functional.py", line 378, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined]

Now I found some other people with simelar AMD cards having this exact issue. I'm curious if there is a quick fix to it or if its more hardware difficult like the old rocm causing memory leaks or something? Thanks <3

NeedsMoar commented 6 months ago

Ditch: --force-fp32 --use-split-cross-attention AMD cards since at least Vega run fp16 at double speed and half the space as packed vector instructions, and LCM models are fine running without fp32 forced on. I think --force-fp32 for AMD cards only applies to the Windows directml backend since nobody bothered implementing fp16 somehow.

Also AFAICT torch has made autocast into a generic thing that works on any backend since the floating point behavior / list of functions that require fp32 precision was never anything NVidia specific, they just implemented autocast in a way that would intentionally crash on everything else.

Since xformers memory efficient attention isn't an option without building it and hoping the code AMD upstreamed for flash attention / memory efficient on MI250X actually does anything on a GPU without matmul cores, you want things to be running either PyTorch's SDP (or the PyTorch built-in memory-efficient / flash attention which should be available in v2.2+ on linux... and from at least v2.2.2 on windows). Comfy will auto-select if you leave it alone.

Xformers still might be worth trying a prebuilt version of, not all of it is written for CUDA.

the advanced attention methods probably don't help on things without real matrix cores but memory efficient should, and it's the difference between quadratic memory usage with batch size vs. linear. Pretty much the only reason anyone with a consumer card can run an SVD model or lots of other animation / large models. I couldn't reliably generate and upscale a single image past 1536x1536 without the risk of OOM on a 7900XTX on the directML backend.

That particular workflow requires a silly amount of loaded models between the base LCM model, the CLiP, and the clip vision, the motion model, and a motion LoRA... It also uses the efficiency nodes which are designed to keep the models they use in vram more or less forever if possible; by the time you hit the ksampler you've loaded (and doubled the size of) all the models up until then and will exceed memory from that alone.

Split attention was one of the worst ways to run on DirectML IIRC for different reasons, and I don't think it's particularly good in general. I'd either manually select pytorch cross attention

Finally I noticed in the workflow you uploaded that you're outputting the latents to a ProRes .mov file... at 4:2:2 subsampled 10bpc HQ quality... I have no idea what the reasoning behind this is; without a Mac, playing it to anything will be a pain... in any case I'd suggest enabling the cpu VAE decode option to see if that helps too. The latent decoding process isn't terribly slow on any modern CPU since it only needs to run once per latent.

In the end though, you're probably going to hit the same conclusion I did... unless you're just gaming and running OpenCL / Vulkan software like the command line versions of ESRGAN, AMD cards that humans can afford are total crap at running ML models outside of Shark which is far too limited to do anything cool with or manually optimizing torch models into ONNX with Olive and running them manually from workflows on the command line. To make either of these methods fast the model needs to be built with fixed input sizes for every size and batch size combination you plan on running and will eat your disk space in a couple of months if you're just trying new LoRAs.

If I could afford it, I'd search for the guy on eBay who is doing mods to old NVidia 2080s to upgrade them from 11GB to 22GB of VRAM and selling them for $500. That's the only cheap way to get a known working card with enough ram to run all the common workflows. AMD cards are good at running games so you should still be able to sell one as old as the 5700XT to someone looking for a cheap card. If you're lucky it will pay for most of the 2080, gamers are mostly morons which is why the 3090 still sells for the same price as the 4090, and why people still build systems with 16GB of ram. I haven't checked the dumb urban legends about AMD cards but if that one had some silly special properties attributed to it it may sell for more than you'd expect, or you might just find somebody who paid someone to build a solid pipe custom water loop centered around a waterblock that only fits that card and who would rather pay the $500 for a working one they can hopefully install themselves than get the entire thing reworked to fit a larger model that they need a new $350 waterblock for, then spend 3000 hours carefully adjusting their RGB LEDs so they show the best around the newer model, so they can stay competitive in counterstrike.