Streamlit SD-Upscale x4, CUDA out of memory. Tried to allocate 400.00 GiB

ryakr commented 1 year ago

Normally the CUDA oom is a normal thing with smaller GPUs but... 400GiB? I dont think that exists as a gpu so this is obviously a bug. 512x512 input. Goes through every ddim step before kaboom. Using conda made with the environment yaml. Running on a 4090 machine. full log: Traceback (most recent call last): File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 556, in _run_script exec(code, module.__dict__) File "Z:\SD\SD_2.0\stablediffusion\scripts\streamlit\superresolution.py", line 170, in <module> run() File "Z:\SD\SD_2.0\stablediffusion\scripts\streamlit\superresolution.py", line 152, in run result = paint( File "Z:\SD\SD_2.0\stablediffusion\scripts\streamlit\superresolution.py", line 109, in paint x_samples_ddim = model.decode_first_stage(samples) File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "z:\sd\sd_2.0\stablediffusion\ldm\models\diffusion\ddpm.py", line 826, in decode_first_stage return self.first_stage_model.decode(z) File "z:\sd\sd_2.0\stablediffusion\ldm\models\autoencoder.py", line 90, in decode dec = self.decoder(z) File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "z:\sd\sd_2.0\stablediffusion\ldm\modules\diffusionmodules\model.py", line 631, in forward h = self.mid.attn_1(h) File "c:\users\------\miniconda3\envs\ldm\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "z:\sd\sd_2.0\stablediffusion\ldm\modules\diffusionmodules\model.py", line 191, in forward w_ = torch.bmm(q,k) # b,hw,hw w[b,i,j]=sum_c q[b,i,c]k[b,c,j] RuntimeError: CUDA out of memory. Tried to allocate 400.00 GiB (GPU 0; 23.99 GiB total capacity; 6.47 GiB already allocated; 0 bytes free; 17.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

kenrox commented 1 year ago

same here; trying to allocate 2304gib (which i checked my pockets, i don't have any in spare :P) trying to upscale

RuntimeError: CUDA out of memory. Tried to allocate 2304.00 GiB (GPU 0; 47.54 GiB total capacity; 10.90 GiB already allocated; 32.78 GiB free; 11.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ewrfcas commented 1 year ago

How to get the upsacle model? This address is not useable now. https://huggingface.co/stabilityai/stable-diffusion-2-depth/resolve/main/x4-upscaler-ema.ckpt

0xdevalias commented 1 year ago

How to get the upsacle model? This address is not useable now.

Presumably you want this repo?

https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/tree/main

With this file potentially?

https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/blob/main/x4-upscaler-ema.ckpt

ewrfcas commented 1 year ago

How to get the upsacle model? This address is not useable now.

Presumably you want this repo?

https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/tree/main

With this file potentially?

https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler/blob/main/x4-upscaler-ema.ckpt

Thanks!

ThibaultLSDC commented 1 year ago

Coming back to the original issue, it actually isn't surprising to me that you get such memory issues, especially if you don't have xformers (I'm assuming there). In the Traceback it says the error occurs in "<>\stablediffusion\ldm\modules\diffusionmodules\model.py" at line 631, which is in the decoder. You're upsampling from 512x512, which means the Decoder gets a 512x512 input, and he applies attention to it (at least once). The attention matrix thus computed should be of size 512²x512², multiply this by 4 bytes per float32, you get a theoretical 275GB tensor. Not quite the values mentionned above but roughly the same order of magnitude. Hope this helps and please tell me if I said smth wrong :)

ryakr commented 1 year ago

Coming back to the original issue, it actually isn't surprising to me that you get such memory issues, especially if you don't have xformers (I'm assuming there). In the Traceback it says the error occurs in "<>\stablediffusion\ldm\modules\diffusionmodules\model.py" at line 631, which is in the decoder. You're upsampling from 512x512, which means the Decoder gets a 512x512 input, and he applies attention to it (at least once). The attention matrix thus computed should be of size 512²x512², multiply this by 4 bytes per float32, you get a theoretical 275GB tensor. Not quite the values mentionned above but roughly the same order of magnitude. Hope this helps and please tell me if I said smth wrong :)

I was using xformers! also isnt this based on similar model info to LDSR? I have been using that so much to upscale my SD gens, 512x512 at least, without xformers. So I have 2 thoughts on why this is happening:

The code isnt really functioning right since this only happens after all the generation steps, so when decoding the image its doing something funky
The model is basically useless and was only made to go from 128->512 for some odd reason.

igorperic17 commented 1 year ago

Try adding this to your .bashrc file:

export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128

And the sourcing it by running:

source ~/.bashrc

ThibaultLSDC commented 1 year ago

Well for some reason I couldn't find the config file for LDSR, so I downloaded the model ckpt and checked the 'state_dict', and there does not seem to be attention in the decoder. In the Decode class from "<>\stablediffusion\ldm\modules\diffusionmodules\model.py", there is the attn_type kwarg which I guess is set to 'none' in LDSR ? But given the Traceback you got there is definitely some attention going on. Btw 2 things I did not say in my first message:

I did manage to infer on higher res thanks to xformers. On a 3090 I was able to do 256->1024. 512->2048 was way too much for me.
I'm running the gradio script, no idea about the differences with the streamlit one.

Looking into SDv2's config file, there is the comment: # attn_type: "vanilla-xformers" this model needs efficient attention to be feasible on HR data, also the decoder seems to break in half precision (UNet is fine though)

Your xformers setup might not work ? Maybe try checking the value of XFORMERS_IS_AVAILABLE in "<>\stablediffusion\ldm\modules\diffusionmodules\model.py". Hope this helps

Edit: checking into SDv2's 'stat_dict', I do find attention layers in the decoder this time

ryakr commented 1 year ago

Xformers does work, I was using it with the main sampler since it is required for 768 on 4090, it also lists the xformer overwrites in console. Cant do any bashrc since this is a windows machine, not everyone runs linux.

thomasf1 commented 1 year ago

Mine tried to allocate 900.00 GiB after reaching 100%. No difference with or without Xformers for me. Running it on colab with the following command:

`!python scripts/gradio/superresolution.py configs/stable-diffusion/x4-upscaling.yaml '/content/drive/MyDrive/AI/models/x4-upscaler-ema.ckpt'

PS: txt2img works fine with the setup

richservo commented 1 year ago

This is actually an issue with the decode first stage. Any high res image uses an excessive amount of vram. I can actually encode a tensor of 960x704 but can’t decode the result since I run out of vram. This is also an issue with img2img. After about 2048 it can still encode but decode needs WAY too much vram. I was considering testing using ldsr ckpt/decode to see if I can get an image out that way. Just sucks it would have to load 2 models to do it.

8ke8 commented 1 year ago

Also attempted to allocate 2034GB when upscaling 768x768 image. My reading of the above discussion is that this is not surprising given the model architecture.

I guess i'm punting on the superresolution until I have time to dig a little deeper on it

alexandercommon commented 1 year ago

As a professional developer, I don't release something if it doesn't work. Unless your intended users are people running massive server farms.

Stability-AI / stablediffusion

Streamlit SD-Upscale x4, CUDA out of memory. Tried to allocate 400.00 GiB #5