csslc / CCSR

Official codes of CCSR: Improving the Stability of Diffusion Models for Content Consistent Super-Resolution
https://csslc.github.io/project-CCSR/
390 stars 30 forks source link

trian model CUDA out of memory #12

Open aoyang-hd opened 5 months ago

aoyang-hd commented 5 months ago

Is there any way to train on 24G on a GTX3090, even with one batch size?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 3; 23.69 GiB total capacity; 23.03 GiB already allocated; 21.69 MiB free; 23.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Epoch 0: 0%| | 2/35135 [00:29<144:16:07, 14.78s/it, loss=0.389, v_num=0, train/loss_simple_step=0.131, train/loss_vlb_step=0.000475, train/loss_step=0.131, global_step=0.000, train/loss_x0_step=0.335, train/loss_x0_from_tao_step=0.366, train/loss_noise_from_tao_step=0.00291, train/loss_net_step=0.704]

cswry commented 5 months ago

Hello, you can try fp16 for training

jfischoff commented 5 months ago

reduce the batch sizes. It is harcoded to 16 but you can reduce them.

On Thu, Feb 8, 2024 at 7:02 AM Rongyuan Wu @.***> wrote:

Hello, you can try fp16 for training

— Reply to this email directly, view it on GitHub https://github.com/csslc/CCSR/issues/12#issuecomment-1934305195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABJEO7ZSPT6W3DCG7LBHN3YSTSGZAVCNFSM6AAAAABCEYQOZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGMYDKMJZGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

zhouyizhuo commented 4 months ago

@aoyang-hd @cswry @jfischoff I wanted to ask if you run it successfully on a single GPU. I'd appreciate it if you could reply to me.

jfischoff commented 4 months ago

yes, I just had to reduce the batch size

zhouyizhuo commented 4 months ago

@jfischoff How long did it take you to complete the training?(●'◡'●)

jfischoff commented 4 months ago

I didn't run the complete training like that. I just did a test. I think it took 2 days on A100 8x

zhouyizhuo commented 4 months ago

Thank you for responding.😊