Open 920703 opened 1 year ago
Hi,
Can you please check the format of printing lr? Maybe the lr is too small.
@Redaimao I have run the code again with following changes.
I have changed the following in your train_net.py file. (Changed according to the training details mentioned in the paper)
See below:
parser.add_argument('--lr_init', type=float, default=0.001, help='learning rate for generator') optimizer = optim.Adam(net.parameters(), lr=opt.lr_init, weight_decay=0.5) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=300, gamma=0.5)
It is running now, lets see if that problem comes again.
But another problem is I am getting loss_bt = 0 from the very start of training. Why so? Is the model overfitted, or something else
Hi, I am not really sure why it is 0 as we didn't encounter such an issue. It may come from the configuration you made, and also as I mentioned you should check the printing format, and how many decimal points are for printing. You can tune lr to see whether the performance improved. Thanks.
The format of printing the learning rate (lr) is as a string with the value of "0.001".
See below:
"Training: Epoch[{:0>3}/{:0>3}] Iteration[{:0>3}/{:0>3}] lr:{} Loss: {:.4f} Loss_pair: {:.4f} Loss_bt: {:.4f} Loss_grads: {:.4f} Loss_ssim: {:.4f} ".format(epoch + 1, opt.max_epoch, i + 1, len(train_loader), '0.001', loss_avg, loss_pair_lable_avg, loss_between_pair_avg, loss_gradients_avg, loss_ssim_avg))
both before and after changing the configuration, loss_bt is consistently zero, but this time lr does not become 0.
See below the screenshot. (showing last epoch with some last iterations)
Training: Epoch[020/020] Iteration[5370/5495] lr:1.330612450002547e-113 Loss: 0.3635 Loss_pair: 0.4194 Loss_bt: 0.0000 Loss_grads: 0.2198 Loss_ssim: 0.0601 Training: Epoch[020/020] Iteration[5380/5495] lr:1.330612450002547e-113 Loss: 0.3662 Loss_pair: 0.4213 Loss_bt: 0.0000 Loss_grads: 0.2278 Loss_ssim: 0.0636 Training: Epoch[020/020] Iteration[5390/5495] lr:1.330612450002547e-113 Loss: 0.3615 Loss_pair: 0.4209 Loss_bt: 0.0000 Loss_grads: 0.1901 Loss_ssim: 0.0573 Training: Epoch[020/020] Iteration[5400/5495] lr:6.653062250012736e-114 Loss: 0.3672 Loss_pair: 0.4200 Loss_bt: 0.0000 Loss_grads: 0.2468 Loss_ssim: 0.0653 Training: Epoch[020/020] Iteration[5410/5495] lr:6.653062250012736e-114 Loss: 0.3699 Loss_pair: 0.4189 Loss_bt: 0.0000 Loss_grads: 0.2767 Loss_ssim: 0.0707 Training: Epoch[020/020] Iteration[5420/5495] lr:6.653062250012736e-114 Loss: 0.3660 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2425 Loss_ssim: 0.0638 Training: Epoch[020/020] Iteration[5430/5495] lr:6.653062250012736e-114 Loss: 0.3621 Loss_pair: 0.4200 Loss_bt: 0.0000 Loss_grads: 0.2011 Loss_ssim: 0.0604 Training: Epoch[020/020] Iteration[5440/5495] lr:6.653062250012736e-114 Loss: 0.3663 Loss_pair: 0.4196 Loss_bt: 0.0000 Loss_grads: 0.2405 Loss_ssim: 0.0656 Training: Epoch[020/020] Iteration[5450/5495] lr:6.653062250012736e-114 Loss: 0.3648 Loss_pair: 0.4203 Loss_bt: 0.0000 Loss_grads: 0.2228 Loss_ssim: 0.0635 Training: Epoch[020/020] Iteration[5460/5495] lr:6.653062250012736e-114 Loss: 0.3642 Loss_pair: 0.4193 Loss_bt: 0.0000 Loss_grads: 0.2256 Loss_ssim: 0.0618 Training: Epoch[020/020] Iteration[5470/5495] lr:6.653062250012736e-114 Loss: 0.3638 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2235 Loss_ssim: 0.0609 Training: Epoch[020/020] Iteration[5480/5495] lr:6.653062250012736e-114 Loss: 0.3655 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2383 Loss_ssim: 0.0625 Training: Epoch[020/020] Iteration[5490/5495] lr:6.653062250012736e-114 Loss: 0.3623 Loss_pair: 0.4197 Loss_bt: 0.0000 Loss_grads: 0.2063 Loss_ssim: 0.0587 Model saved Finished Training net_save_path= ./Result/Latest/01-30_08-07-43/20_net_params.pkl
这是因为你的lr下降的太快了导致无限接近于0
@Redaimao @Zhaohaojie4598 I am ruuning the model with 20 epochs, but after few iterations in the very first epoch, I am getting loss_bt=0. I am not able to understand the reason behind this. Please help.
and second problem is I have set step size to 300 in the learning rate as my batch size is 8. See above, at 300th iteration, how does it become 0.00025? and immediately in the next iteration it multiplies by 0.5 and gives 0.0005. From where does 0.00025 come?
Please reply. I am waiting for your response. Thank you
I was training your model. I had run it for 20 epochs and set the training batch size to 5. During training , i have seen that when it comes to 6th epoch at iteration 4660 out of 5495, the learning rate becomes 0.0 and it remain until the training finished. i. till 20th epoch.
and the last epoch results is
what is the reason behind this? I have used all the default values, nothing changed.
Any help will be appreciated. Thanks