Closed huaqlili closed 10 months ago
Your actual batch size is only 4, which is too small for model recovery. You can increase the training epochs by a factor of 5. As for the decline in performance, I believe it may be due to the combination of a small batch size and an unchanged learning rate. You can resume training from the epoch where the model did not experience a drop in performance.
i have the same question
Hello, due to limited hardware resources, I trained the deblur network in the S1 stage using two 3090 GPUs. To avoid running out of memory, I reduced the batch_size_pergpu to 2. However, during the training process, the PSNR only reached around 26 for the first 20,000 iterations, but then the loss sharply increased and the PSNR dropped to around 5. Could you please help me understand the possible reasons for this issue? I would greatly appreciate your response!