What is the burst_size that can be supported by 1 gpu ?

pokaaa commented 2 years ago

I can only set burst_size=4. If burst_size is set 8 or 16, it will be out of memory right? Is that because the model is too big?

Algolzw commented 2 years ago

Yes, the default model is too big to train on a single GPU. If you want to retrain the model on your own datasets, we advise that you 1) load the pretrained model, 2) freeze the reconstruction module, and 3) finetune the rest layers.

Or maybe you can add the "--fp16" option, use smaller batch size and patch size.

pokaaa commented 2 years ago

Yes, the default model is too big to train on a single GPU. If you want to retrain the model on your own datasets, we advise that you 1) load the pretrained model, 2) freeze the reconstruction module, and 3) finetune the rest layers.

Or maybe you can add the "--fp16" option, use smaller batch size and patch size.

Got it. That helps a lot. Thank you. And I wonder how long it takes for traning one epoch with 4 gpus ? I use gradient accumulation with bust_size=4 and accumulation=4 so that the actual burst_size can be 16. But it seems that one epoch needs nearly 5 hours with 1 gpu...

Algolzw commented 2 years ago

Yes, the default model is too big to train on a single GPU. If you want to retrain the model on your own datasets, we advise that you 1) load the pretrained model, 2) freeze the reconstruction module, and 3) finetune the rest layers. Or maybe you can add the "--fp16" option, use smaller batch size and patch size.

Got it. That helps a lot. Thank you. And I wonder how long it takes for traning one epoch with 4 gpus ? I use gradient accumulation with bust_size=4 and accumulation=4 so that the actual burst_size can be 16. But it seems that one epoch needs nearly 5 hours with 1 gpu...

It takes about 3200s per epoch on 4 2080Ti GPUs. What is your patch_size? Maybe you should reduce the model size to accelerate training, e.g., n_feats=64, n_resblocks=5, n_resgroups=4.

sfxz035 commented 2 years ago

Yes, the default model is too big to train on a single GPU. If you want to retrain the model on your own datasets, we advise that you 1) load the pretrained model, 2) freeze the reconstruction module, and 3) finetune the rest layers. Or maybe you can add the "--fp16" option, use smaller batch size and patch size.

Got it. That helps a lot. Thank you. And I wonder how long it takes for traning one epoch with 4 gpus ? I use gradient accumulation with bust_size=4 and accumulation=4 so that the actual burst_size can be 16. But it seems that one epoch needs nearly 5 hours with 1 gpu...

It takes about 3200s per epoch on 4 2080Ti GPUs. What is your patch_size? Maybe you should reduce the model size to accelerate training, e.g., n_feats=64, n_resblocks=5, n_resgroups=4. Hi, i have some questions, hope to get your answer. I want to retrain this model on a new dataset, so i want to know something about training. Did you train for 602 epochs and take 3200s per epoch on 4 2080Ti GPUs? If so, did you take about 22 days to complete the training?Have you tried reducing the number of epochs, is there a performance hit?

Algolzw commented 2 years ago

@sfxz035 In the NTIRE challenge 2021, we use 8 GPUs, taking about 10 days. I think 300 epochs are also enough to train the model. The performance gain would be small after 300 epochs. And you may set the learning rate decay to '100-200'.

Moreover, in the NTIRE challenge 2022, our winner method, BSRT, also set the total epochs to 300.

Algolzw / EBSR

What is the burst_size that can be supported by 1 gpu ? #11