RuntimeError: CUDA out of memory with RTX 3090 (24 GB VRAM)

XavierXiao / Dreambooth-Stable-Diffusion

Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) with Stable Diffusion

MIT License

7.6k stars 795 forks source link

RuntimeError: CUDA out of memory with RTX 3090 (24 GB VRAM) #67

Open Tuxius opened 2 years ago

Tuxius commented 2 years ago

Following 1:1 the instructions I get an out of Memory despite having 24 GB VRAM available:

  File "Y:\221009_dreambooth\ldm\modules\attention.py", line 180, in forward
    sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
  File "C:\Users\frank\anaconda3\envs\dreambooth\lib\site-packages\torch\functional.py", line 327, in einsum
    return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 24.00 GiB total capacity; 22.74 GiB already allocated; 0 bytes free; 23.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried some changes in v1-finetune_unfrozen.yaml (e.g. num_workers: from 2 to 1), but no improvement.

Has anybody successfully run this under Windows with 24 GB VRAM?

mengen-li commented 2 years ago

you can run it on windows wsl2 https://www.youtube.com/watch?v=w6PTviOCYQY&t=15s

wyang22 commented 2 years ago

Is it possible to run the training on 11 GB VRAM?

dminGod commented 2 years ago

I was getting a lot of out of memory on 24GB 3090 -- I ended up using a bigger server -- and saw it would consume upto 28GB RAM, went upto 30GB at one point.

Could be possible some config needs to be tweaked while running, not sure 🤷🏼‍♂️

Tuxius commented 2 years ago

@ChinaArvin : Thank you, yes following these instruction it works now nicely well below 24 GB. However, having to use WSL feels like a workaround, even though I enjoy Linux command line. It should be possible to get this running under native Windows?

@wyang22: If you sacrifice some settings even below 11GB under WSL are possible, just follow the instructions of the video

dminGod commented 2 years ago

I finally ended up using this:

https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth

fr34kyn01535 commented 1 year ago

It does indeed not work with a 3090 on windows 11, but fine under WSL on the same machine, same (default) config. Must be a bug on windows then..

titusfx commented 1 year ago

@wyang22 see https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth you have several configuration there:

Use the table below to choose the best flags based on your memory and speed requirements. Tested on Tesla T4 GPU.

fp16	train_batch_size	gradient_accumulation_steps	gradient_checkpointing	use_8bit_adam	GB VRAM usage	Speed (it/s)
fp16	1	1	TRUE	TRUE	9.92	0.93
no	1	1	TRUE	TRUE	10.08	0.42
fp16	2	1	TRUE	TRUE	10.4	0.66
fp16	1	1	FALSE	TRUE	11.17	1.14
no	1	1	FALSE	TRUE	11.17	0.49
fp16	1	2	TRUE	TRUE	11.56	1
fp16	2	1	FALSE	TRUE	13.67	0.82
fp16	1	2	FALSE	TRUE	13.7	0.83
fp16	1	1	TRUE	FALSE	15.79	0.77

AntouanK commented 1 year ago

any flag I use, I always get the CUDA out of memory error. How are you all using this? Can anyone post an example? I'm trying it on a 4090 with 24GB

jbohnslav commented 1 year ago

I'm also having OOM errors with a 3090 with 24GB. Batch size set to 1, I even set the precision flag on the Trainer to 16.

htsh commented 1 year ago

Did anyone ever find a solution? I am also getting this error on a 3090ti

dminGod commented 1 year ago

any flag I use, I always get the CUDA out of memory error. How are you all using this? Can anyone post an example? I'm trying it on a 4090 with 24GB

I haven't tried with this repo, but if you are trying to train a 768 model, and don't have xformers installed correctly it will go OOM. The 768 model training hovers around 21GB VRAM. I think the 512 models should train fine.

schematical commented 1 year ago

I finally ended up using this:

https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth

Yesss!! Finally yeah this worked. I am running on 8GB and finally got it to train using the info in this section: https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth#training-on-a-8-gb-gpu

DeepSpeed was the final piece that I needed.

Best of luck!

nhatItsforce commented 1 year ago

@wyang22 see https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth you have several configuration there:

Use the table below to choose the best flags based on your memory and speed requirements. Tested on Tesla T4 GPU.

@titusfx Where can I edit these configurations or put that in?