Training issue - Githubissues

920703 commented 1 year ago

I was training your model. I had run it for 20 epochs and set the training batch size to 5. During training , i have seen that when it comes to 6th epoch at iteration 4660 out of 5495, the learning rate becomes 0.0 and it remain until the training finished. i. till 20th epoch.

and the last epoch results is

what is the reason behind this? I have used all the default values, nothing changed.

Any help will be appreciated. Thanks

Redaimao commented 1 year ago

Hi,

Can you please check the format of printing lr? Maybe the lr is too small.

920703 commented 1 year ago

@Redaimao I have run the code again with following changes.

I have changed the following in your train_net.py file. (Changed according to the training details mentioned in the paper)

Changed the learning rate from 0.05 to 0.001
Added weight_decay=0.5, because it was not mentioned there.
Increased the step size in scheduler to 300 because I am using train_bs=5 and 20 epochs. By doing so, there are 5495 iterations in one epoch. So the learning rate will be reduced every 300th iteration

See below:

parser.add_argument('--lr_init', type=float, default=0.001, help='learning rate for generator') optimizer = optim.Adam(net.parameters(), lr=opt.lr_init, weight_decay=0.5) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=300, gamma=0.5)

It is running now, lets see if that problem comes again.

But another problem is I am getting loss_bt = 0 from the very start of training. Why so? Is the model overfitted, or something else

Redaimao commented 1 year ago

Hi, I am not really sure why it is 0 as we didn't encounter such an issue. It may come from the configuration you made, and also as I mentioned you should check the printing format, and how many decimal points are for printing. You can tune lr to see whether the performance improved. Thanks.

920703 commented 1 year ago

The format of printing the learning rate (lr) is as a string with the value of "0.001".

See below:

"Training: Epoch[{:0>3}/{:0>3}] Iteration[{:0>3}/{:0>3}] lr:{} Loss: {:.4f} Loss_pair: {:.4f} Loss_bt: {:.4f} Loss_grads: {:.4f} Loss_ssim: {:.4f} ".format(epoch + 1, opt.max_epoch, i + 1, len(train_loader), '0.001', loss_avg, loss_pair_lable_avg, loss_between_pair_avg, loss_gradients_avg, loss_ssim_avg))

both before and after changing the configuration, loss_bt is consistently zero, but this time lr does not become 0.

See below the screenshot. (showing last epoch with some last iterations)

Training: Epoch[020/020] Iteration[5370/5495] lr:1.330612450002547e-113 Loss: 0.3635 Loss_pair: 0.4194 Loss_bt: 0.0000 Loss_grads: 0.2198 Loss_ssim: 0.0601 Training: Epoch[020/020] Iteration[5380/5495] lr:1.330612450002547e-113 Loss: 0.3662 Loss_pair: 0.4213 Loss_bt: 0.0000 Loss_grads: 0.2278 Loss_ssim: 0.0636 Training: Epoch[020/020] Iteration[5390/5495] lr:1.330612450002547e-113 Loss: 0.3615 Loss_pair: 0.4209 Loss_bt: 0.0000 Loss_grads: 0.1901 Loss_ssim: 0.0573 Training: Epoch[020/020] Iteration[5400/5495] lr:6.653062250012736e-114 Loss: 0.3672 Loss_pair: 0.4200 Loss_bt: 0.0000 Loss_grads: 0.2468 Loss_ssim: 0.0653 Training: Epoch[020/020] Iteration[5410/5495] lr:6.653062250012736e-114 Loss: 0.3699 Loss_pair: 0.4189 Loss_bt: 0.0000 Loss_grads: 0.2767 Loss_ssim: 0.0707 Training: Epoch[020/020] Iteration[5420/5495] lr:6.653062250012736e-114 Loss: 0.3660 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2425 Loss_ssim: 0.0638 Training: Epoch[020/020] Iteration[5430/5495] lr:6.653062250012736e-114 Loss: 0.3621 Loss_pair: 0.4200 Loss_bt: 0.0000 Loss_grads: 0.2011 Loss_ssim: 0.0604 Training: Epoch[020/020] Iteration[5440/5495] lr:6.653062250012736e-114 Loss: 0.3663 Loss_pair: 0.4196 Loss_bt: 0.0000 Loss_grads: 0.2405 Loss_ssim: 0.0656 Training: Epoch[020/020] Iteration[5450/5495] lr:6.653062250012736e-114 Loss: 0.3648 Loss_pair: 0.4203 Loss_bt: 0.0000 Loss_grads: 0.2228 Loss_ssim: 0.0635 Training: Epoch[020/020] Iteration[5460/5495] lr:6.653062250012736e-114 Loss: 0.3642 Loss_pair: 0.4193 Loss_bt: 0.0000 Loss_grads: 0.2256 Loss_ssim: 0.0618 Training: Epoch[020/020] Iteration[5470/5495] lr:6.653062250012736e-114 Loss: 0.3638 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2235 Loss_ssim: 0.0609 Training: Epoch[020/020] Iteration[5480/5495] lr:6.653062250012736e-114 Loss: 0.3655 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2383 Loss_ssim: 0.0625 Training: Epoch[020/020] Iteration[5490/5495] lr:6.653062250012736e-114 Loss: 0.3623 Loss_pair: 0.4197 Loss_bt: 0.0000 Loss_grads: 0.2063 Loss_ssim: 0.0587 Model saved Finished Training net_save_path= ./Result/Latest/01-30_08-07-43/20_net_params.pkl

Zhaohaojie4598 commented 1 year ago

这是因为你的lr下降的太快了导致无限接近于0

920703 commented 1 year ago

@Redaimao @Zhaohaojie4598 I am ruuning the model with 20 epochs, but after few iterations in the very first epoch, I am getting loss_bt=0. I am not able to understand the reason behind this. Please help.

Untitled

and second problem is I have set step size to 300 in the learning rate as my batch size is 8. See above, at 300th iteration, how does it become 0.00025? and immediately in the next iteration it multiplies by 0.5 and gives 0.0005. From where does 0.00025 come?

Please reply. I am waiting for your response. Thank you

Redaimao / DRPL

Training issue #5