Issue with training - Githubissues

Kadakol commented 2 years ago

I have followed the instructions listed in the README and completed the training. However, I ran into a few issues during the training.

During training of RealBasicVSR Net, the model started giving NaN losses after iteration 220000. Attaching the training logs for the same - RealBasicVSR Net.log, RealBasicVSR.log.
I used the model iter_220000 as the initialization for RealBasicVSR GAN model. After model training was completed, the images generated during inference seem to be of very poor quality. I have included a few sample generated images from the VideoLQ dataset. The same is reproducible even on images from the Vid4 dataset.

No changes have been made to the source code. The code commit ID used is fa3d3284664b05341867f51149c12e10a002fc0f from Jan 17, 2022.

Could you please let me know how this can be resolved? Also please let me know if any more information is required from my side in order to debug this.

00000000

ckkelvinchan commented 2 years ago

From the log it seems that you are using 1 GPU? I think the problem may result from the unstable gradient. In our experiments, we use 8 GPU with batch size = 1 for each GPU. The batch size could be too small if only one GPU is used.

Maybe you can try increase the batch size to 2 and reduce the length from 15 to 8 to see if it is better.

Kadakol commented 2 years ago

Hi @ckkelvinchan . You're right. I am using 1 GPU for my training.

Let me try out your suggestion and see if it helps. Thank you!

Feynman1999 commented 7 months ago

Hi @ckkelvinchan . You're right. I am using 1 GPU for my training.

Let me try out your suggestion and see if it helps. Thank you!

dose the larger batchsize helps?

ckkelvinchan / RealBasicVSR

Issue with training #28