CUDA out of memory Issue

arandomgoodguy commented 2 years ago

Hi, very great work!

I just followed the instruction of pokemon_finetune.ipynb and tried to run it on Colab with one Tesla V100 and High ram

with the setting of

BATCH_SIZE = 1 N_GPUS = 1 ACCUMULATE_BATCHES = 1

It did show something down below so I think it's okay

Epoch 0: 0% 0/833 [00:00<00:00, 5637.51it/s] Summoning checkpoint. tcmalloc: large alloc 1258086400 bytes == 0x7fa9e73c0000 @ ...

until the RuntimeError: CUDA out of memory occurred.

Since I am already using the best of Colab to run it, so I wonder whether there is a way/trick to make it executable on Colab Pro+?

let's say trick of saving memory or use another smaller but similar model instead maybe?

Many thanks !!

my-other-github-account commented 2 years ago

The description Lambda provided is actually incorrect when they say 16GB of VRAM is enough, even 24GB of VRAM isn't enough to run the model, I can personally confirm from attempts on an 24GB A10G where I got OOMs even at batch size 1 with 64x64 images once validation occurred. When I used an RTX6000 the same way they did, everything worked fine.

V100 typically only has 16GB of VRAM, so you wouldn't be able to finetune - unless you get a 32GB V100 SKU which I don't know if Colab has.

If you wanted to start training totally from scratch, you could technically modify the parameters of the pokemon.yaml, but starting from scratch won't get you anywhere - as the original checkpoint was extremely expensive to train, and you would need a much larger dataset than the one provided and a very, very long amount of training time. Unfortunately, when it comes to finetuning you are pretty much stuck with the model they give you.

The only memory optimization that could maybe work is moving the model to FP16/mixed precision, but my lazy/naive attempts at doing this using the PyT Lightning Trainer configuration fail, as there are many custom modules here that use FP32 during their forward pass that need fixing.

arandomgoodguy commented 2 years ago

Hi, thanks for your experience.

I can actually get A100 on Colab sometimes when I use standard RAM, but if I adjust the setting to high RAM, Colab switch it to V100 or P100 though.

That's say if I can get A100, the training and validating would be fine, but the notebook would crash before that since downloading models needs a high RAM, so now the problem is how to get instance of A100 + High RAM simultaneously in order to do the whole thing on Colab, is my understanding right?

soon-yau commented 2 years ago

I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.

justinpinkney commented 2 years ago

@my-other-github-account sorry my description wasn't very good. All I knew was that 16GB wasn't enough, if you can believe it I didn't have a GPU with less than 40GB to test on! 😅

I'll update the description to be >24GB!

my-other-github-account commented 2 years ago

To update, yeah I can also get it work with 24GB (barely) - but only if image logging and validation is disabled at BS1.

gstswwx commented 1 year ago

@my-other-github-account Thanks for your sharing! I want to know how you disabled image logging and validation? And what does "BS1" mean?

KyonP commented 1 year ago

@my-other-github-account +1 I am also having a problem with training a custom dataset from scratch on 3090 24GB VRAM.

keep giving me OOM error when param.clone()

I am also wondering how to disable image logging and what is "BS1".

any other suggestions would be grateful.

  File "/root/anaconda3/envs/ldm/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/user99/stable-diffusion/ldm/models/diffusion/ddpm.py", line 174, in ema_scope
    self.model_ema.store(self.model.parameters())
  File "/home/user99/stable-diffusion/ldm/modules/ema.py", line 62, in store
    self.collected_params = [param.clone() for param in parameters]
  File "/home/user99/stable-diffusion/ldm/modules/ema.py", line 62, in <listcomp>
    self.collected_params = [param.clone() for param in parameters]
RuntimeError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 23.70 GiB total capacity; 21.72 GiB already allocated; 34.81 MiB free; 21.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

lc82111 commented 1 year ago

Me too. Did you get reasonable results with batch_size 1?

I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.

KyonP commented 1 year ago

Me too. Did you get reasonable results with batch_size 1?

I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.

It didn't help even though I set the config to batch_size 1.

I found a slower but larger GPU machine to run SD. 😢

lc82111 commented 1 year ago

Me too. Did you get reasonable results with batch_size 1?

I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.

It didn't help even though I set the config to batch_size 1.

I found a slower but larger GPU machine to run SD. 😢

I also disable the image logger to get it to fit in my 24GB 3090.

He-Chao commented 1 year ago

I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.

hi, Excuse me, how to modify the code to disable image logging and validation?

He-Chao commented 1 year ago

To update, yeah I can also get it work with 24GB (barely) - but only if image logging and validation is disabled at BS1. hi, excuse me, how to modify the code to disable image logging and validation?

soon-yau commented 1 year ago

I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.

hi, Excuse me, how to modify the code to disable image logging and validation?

Comment out the logger section in the config file.

He-Chao commented 1 year ago

Got it, thanks

LambdaLabsML / examples

CUDA out of memory Issue #12