Closed arandomgoodguy closed 2 years ago
The description Lambda provided is actually incorrect when they say 16GB of VRAM is enough, even 24GB of VRAM isn't enough to run the model, I can personally confirm from attempts on an 24GB A10G where I got OOMs even at batch size 1 with 64x64 images once validation occurred. When I used an RTX6000 the same way they did, everything worked fine.
V100 typically only has 16GB of VRAM, so you wouldn't be able to finetune - unless you get a 32GB V100 SKU which I don't know if Colab has.
If you wanted to start training totally from scratch, you could technically modify the parameters of the pokemon.yaml, but starting from scratch won't get you anywhere - as the original checkpoint was extremely expensive to train, and you would need a much larger dataset than the one provided and a very, very long amount of training time. Unfortunately, when it comes to finetuning you are pretty much stuck with the model they give you.
The only memory optimization that could maybe work is moving the model to FP16/mixed precision, but my lazy/naive attempts at doing this using the PyT Lightning Trainer configuration fail, as there are many custom modules here that use FP32 during their forward pass that need fixing.
Hi, thanks for your experience.
I can actually get A100 on Colab sometimes when I use standard RAM, but if I adjust the setting to high RAM, Colab switch it to V100 or P100 though.
That's say if I can get A100, the training and validating would be fine, but the notebook would crash before that since downloading models needs a high RAM, so now the problem is how to get instance of A100 + High RAM simultaneously in order to do the whole thing on Colab, is my understanding right?
I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.
@my-other-github-account sorry my description wasn't very good. All I knew was that 16GB wasn't enough, if you can believe it I didn't have a GPU with less than 40GB to test on! 😅
I'll update the description to be >24GB!
To update, yeah I can also get it work with 24GB (barely) - but only if image logging and validation is disabled at BS1.
@my-other-github-account Thanks for your sharing! I want to know how you disabled image logging and validation? And what does "BS1" mean?
@my-other-github-account +1 I am also having a problem with training a custom dataset from scratch on 3090 24GB VRAM.
keep giving me OOM error when param.clone()
I am also wondering how to disable image logging and what is "BS1".
any other suggestions would be grateful.
File "/root/anaconda3/envs/ldm/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/home/user99/stable-diffusion/ldm/models/diffusion/ddpm.py", line 174, in ema_scope
self.model_ema.store(self.model.parameters())
File "/home/user99/stable-diffusion/ldm/modules/ema.py", line 62, in store
self.collected_params = [param.clone() for param in parameters]
File "/home/user99/stable-diffusion/ldm/modules/ema.py", line 62, in <listcomp>
self.collected_params = [param.clone() for param in parameters]
RuntimeError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 23.70 GiB total capacity; 21.72 GiB already allocated; 34.81 MiB free; 21.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Me too. Did you get reasonable results with batch_size 1?
I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.
Me too. Did you get reasonable results with batch_size 1?
I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.
It didn't help even though I set the config to batch_size 1
.
I found a slower but larger GPU machine to run SD. 😢
Me too. Did you get reasonable results with batch_size 1?
I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.
It didn't help even though I set the config to
batch_size 1
.I found a slower but larger GPU machine to run SD. 😢
I also disable the image logger to get it to fit in my 24GB 3090.
I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.
hi, Excuse me, how to modify the code to disable image logging and validation?
To update, yeah I can also get it work with 24GB (barely) - but only if image logging and validation is disabled at BS1. hi, excuse me, how to modify the code to disable image logging and validation?
I could barely fit it into my 24GB GPU with batch size of 1 AND image logging disabled.
hi, Excuse me, how to modify the code to disable image logging and validation?
Comment out the logger section in the config file.
Got it, thanks
Hi, very great work!
I just followed the instruction of pokemon_finetune.ipynb and tried to run it on Colab with one Tesla V100 and High ram
with the setting of
BATCH_SIZE = 1 N_GPUS = 1 ACCUMULATE_BATCHES = 1
It did show something down below so I think it's okay
Epoch 0: 0% 0/833 [00:00<00:00, 5637.51it/s] Summoning checkpoint. tcmalloc: large alloc 1258086400 bytes == 0x7fa9e73c0000 @ ...
until the RuntimeError: CUDA out of memory occurred.
Since I am already using the best of Colab to run it, so I wonder whether there is a way/trick to make it executable on Colab Pro+?
let's say trick of saving memory or use another smaller but similar model instead maybe?
Many thanks !!