Zj-BinXia / DiffIR

This project is the official implementation of 'Diffir: Efficient diffusion model for image restoration', ICCV2023
Apache License 2.0
451 stars 18 forks source link

During the deblur reproduction, the trainS1 does not converge. #21

Closed huaqlili closed 10 months ago

huaqlili commented 10 months ago

Hello, due to limited hardware resources, I trained the deblur network in the S1 stage using two 3090 GPUs. To avoid running out of memory, I reduced the batch_size_pergpu to 2. However, during the training process, the PSNR only reached around 26 for the first 20,000 iterations, but then the loss sharply increased and the PSNR dropped to around 5. Could you please help me understand the possible reasons for this issue? I would greatly appreciate your response!

Zj-BinXia commented 10 months ago

Your actual batch size is only 4, which is too small for model recovery. You can increase the training epochs by a factor of 5. As for the decline in performance, I believe it may be due to the combination of a small batch size and an unchanged learning rate. You can resume training from the epoch where the model did not experience a drop in performance.

Wxsxulin commented 2 months ago

i have the same question