Paper99 / SRFBN_CVPR19

Pytorch code for our paper "Feedback Network for Image Super-Resolution" (CVPR2019)
MIT License
549 stars 126 forks source link

Gradient explosion #2

Closed urbaneman closed 5 years ago

urbaneman commented 5 years ago

Hi! Thanks for your work! When I try train the original SRFBNx2 in my own dataset, the gradient explosion. The logging is this,

===> Training Epoch: [1/1000]... Learning Rate: 0.000100 Epoch: [1/1000]: 100%|███| 985/985 [02:24<00:00, 7.63it/s, Batch Loss: 18.7957]

Epoch: [1/1000] Avg Train Loss: 18.989669 ===> Validating... [MyImage] PSNR: 32.84 SSIM: 0.9147 Loss: 4.540801 Best PSNR: 32.84 in Epoch: [1] ===> Saving last checkpoint to [experiments/SRFBN_in3f32_x2/epochs/last_ckp.pth] ...] ===> Saving best checkpoint to [experiments/SRFBN_in3f32_x2/epochs/best_ckp.pth] ...]

===> Training Epoch: [2/1000]... Learning Rate: 0.000100 Epoch: [2/1000]: 54%|██████████████████████████▎ | 530/985 [01:18<01:07, 6.78it/s, Batch Loss: 3829046.5000][Warning] Skip this batch! (Loss: 11705110.0) Epoch: [2/1000]: 56%|██████████████████████████ | 547/985 [01:20<01:03, 6.84it/s, Batch Loss: 930627584.0000][Warning] Skip this batch! (Loss: 2895044864.0) Epoch: [2/1000]: 57%|█████████████████████████▋ | 562/985 [01:22<01:02, 6.80it/s, Batch Loss: 24168931328.0000]

I just change the dataset and set a bigger dataset repeat. By the way, my dataset's HR images is w//4 and h// ==0. And I train the FBSRNx4 successful, get the PSNR=29.18 in the dataset. What is the problem do you think? Thanks!

Paper99 commented 5 years ago

Sorry for the late reply. Could you please give me more details about your environment?

urbaneman commented 5 years ago

Sorry for the late reply, also. ^_^ I had reset my dataset and all of the dataset's image is w//12==0 && h//12 == 0, and train in the X3 successful with PSNR = 31.54. However, the X2 model is alse gradient is exploded in the epoch 2, like 1100 step with batchsize 32. I tried set the lr change to 0.0005 in epoch 2, but the problem is also happened. The X3X4 models were trained good, so I have no idea about this happening. By the way, my environment is: python 3.5.2 torch 1.0.0
torchvision 0.2.1
numpy 1.15.0 opencv-python 3.4.2
imageio 2.4.1 Any thing else? Looking forward to your reply.

Paper99 commented 5 years ago

I re-trained our SRFBN on NVIDIA 1080Ti/1060 GPUs with Ubuntu 16.04 environment and haven't encountered this problem so far. Could you please tell me the version of CUDA and Ubuntu on your computer. Your feedback will better help us to solve this problem.

urbaneman commented 5 years ago

Ubuntu 16.04 and CUDA 9.0 in titan V

Paper99 commented 5 years ago

OK, I will check your mentioned problem carefully.

urbaneman commented 5 years ago

Thanks so much.

urbaneman commented 5 years ago

I changed the optimizer from ADAM to SGD. Sadly, the gradient explosion happened again. I trained the dataset in MDSR model and get the model which PSNR=35.05 in x2, so the datasets may have not errors. The log like this in FBSRN x2 with SGD:

===> Start Train

Method: SRFBN || Scale: 2 || Epoch Range: (1 ~ 1000)

===> Training Epoch: [1/1000]... Learning Rate: 0.000100 Epoch: [1/1000]: 100%|████████████████████████████| 493/493 [01:10<00:00, 7.01it/s, Batch Loss: 16.1996]

Epoch: [1/1000] Avg Train Loss: 19.128814 ===> Validating... [MyImage] PSNR: 31.42 SSIM: 0.8953 Loss: 5.031117 Best PSNR: 31.42 in Epoch: [1] ===> Saving last checkpoint to [experiments/SRFBN_in3f32_x2/epochs/last_ckp.pth] ...] ===> Saving best checkpoint to [experiments/SRFBN_in3f32_x2/epochs/best_ckp.pth] ...]

===> Training Epoch: [2/1000]... Learning Rate: 0.000100 Epoch: [2/1000]: 100%|████████████████████████████| 493/493 [01:10<00:00, 6.99it/s, Batch Loss: 25.9972]

Epoch: [2/1000] Avg Train Loss: 18.650895 ===> Validating... [MyImage] PSNR: 31.70 SSIM: 0.9007 Loss: 4.926435 Best PSNR: 31.70 in Epoch: [2] ===> Saving last checkpoint to [experiments/SRFBN_in3f32_x2/epochs/last_ckp.pth] ...] ===> Saving best checkpoint to [experiments/SRFBN_in3f32_x2/epochs/best_ckp.pth] ...]

===> Training Epoch: [3/1000]... Learning Rate: 0.000100 Epoch: [3/1000]: 100%|████████████████████████████| 493/493 [01:10<00:00, 7.00it/s, Batch Loss: 13.6300]

Epoch: [3/1000] Avg Train Loss: 18.457663 ===> Validating... [MyImage] PSNR: 31.80 SSIM: 0.9021 Loss: 4.823419 Best PSNR: 31.80 in Epoch: [3] ===> Saving last checkpoint to [experiments/SRFBN_in3f32_x2/epochs/last_ckp.pth] ...] ===> Saving best checkpoint to [experiments/SRFBN_in3f32_x2/epochs/best_ckp.pth] ...]

===> Training Epoch: [4/1000]... Learning Rate: 0.000100 Epoch: [4/1000]: 100%|████████████████████████████| 493/493 [01:10<00:00, 6.97it/s, Batch Loss: 14.3672]

Epoch: [4/1000] Avg Train Loss: 18.468359 ===> Validating... [MyImage] PSNR: 32.07 SSIM: 0.9050 Loss: 4.720620 Best PSNR: 32.07 in Epoch: [4] ===> Saving last checkpoint to [experiments/SRFBN_in3f32_x2/epochs/last_ckp.pth] ...] ===> Saving best checkpoint to [experiments/SRFBN_in3f32_x2/epochs/best_ckp.pth] ...]

===> Training Epoch: [5/1000]... Learning Rate: 0.000100 Epoch: [5/1000]: 100%|████████████████████████████| 493/493 [01:10<00:00, 7.00it/s, Batch Loss: 20.2245]

Epoch: [5/1000] Avg Train Loss: 18.269129 ===> Validating... [MyImage] PSNR: 32.15 SSIM: 0.9071 Loss: 4.810636 Best PSNR: 32.15 in Epoch: [5] ===> Saving last checkpoint to [experiments/SRFBN_in3f32_x2/epochs/last_ckp.pth] ...] ===> Saving best checkpoint to [experiments/SRFBN_in3f32_x2/epochs/best_ckp.pth] ...]

===> Training Epoch: [6/1000]... Learning Rate: 0.000100 Epoch: [6/1000]: 100%|████████████████████████████| 493/493 [01:10<00:00, 6.99it/s, Batch Loss: 22.2137]

Epoch: [6/1000] Avg Train Loss: 18.185644 ===> Validating... [MyImage] PSNR: 31.82 SSIM: 0.9033 Loss: 5.109773 Best PSNR: 32.15 in Epoch: [5] ===> Saving last checkpoint to [experiments/SRFBN_in3f32_x2/epochs/last_ckp.pth] ...]

===> Training Epoch: [7/1000]... Learning Rate: 0.000100 Epoch: [7/1000]: 100%|████████████████████████████| 493/493 [01:10<00:00, 7.00it/s, Batch Loss: 15.9942]

Epoch: [7/1000] Avg Train Loss: 18.422250 ===> Validating... [MyImage] PSNR: 31.96 SSIM: 0.9046 Loss: 4.937292 Best PSNR: 32.15 in Epoch: [5] ===> Saving last checkpoint to [experiments/SRFBN_in3f32_x2/epochs/last_ckp.pth] ...]

===> Training Epoch: [8/1000]... Learning Rate: 0.000100 Epoch: [8/1000]: 100%|████████████████████████████| 493/493 [01:10<00:00, 6.98it/s, Batch Loss: 15.5571]

Epoch: [8/1000] Avg Train Loss: 19.417537 ===> Validating... [MyImage] PSNR: 30.29 SSIM: 0.8728 Loss: 5.510315 Best PSNR: 32.15 in Epoch: [5] ===> Saving last checkpoint to [experiments/SRFBN_in3f32_x2/epochs/last_ckp.pth] ...]

===> Training Epoch: [9/1000]... Learning Rate: 0.000100 Epoch: [9/1000]: 91%|█████████████████████████▍ | 447/493 [01:03<00:06, 7.08it/s, Batch Loss: 27.9309][Warning] Skip this batch! (Loss: 101.48338317871094) Epoch: [9/1000]: 91%|████████████████████████▌ | 448/493 [01:04<00:06, 7.13it/s, Batch Loss: 101.4834][Warning] Skip this batch! (Loss: 108.13626098632812) Epoch: [9/1000]: 91%|████████████████████████▌ | 449/493 [01:04<00:06, 7.19it/s, Batch Loss: 108.1363][Warning] Skip this batch! (Loss: 108.7495346069336) Epoch: [9/1000]: 91%|████████████████████████▋ | 450/493 [01:04<00:05, 7.19it/s, Batch Loss: 108.7495][Warning] Skip this batch! (Loss: 120.21204376220703) Epoch: [9/1000]: 91%|████████████████████████▋ | 451/493 [01:04<00:05, 7.15it/s, Batch Loss: 120.2120][Warning] Skip this batch! (Loss: 114.43475341796875) Epoch: [9/1000]: 92%|████████████████████████▊ | 452/493 [01:04<00:05, 7.17it/s, Batch Loss: 114.4348][Warning] Skip this batch! (Loss: 102.55879211425781) Epoch: [9/1000]: 92%|████████████████████████▊ | 453/493 [01:04<00:05, 7.15it/s, Batch Loss: 102.5588][Warning] Skip this batch! (Loss: 105.85001373291016) Epoch: [9/1000]: 92%|████████████████████████▊ | 454/493 [01:04<00:05, 7.16it/s, Batch Loss: 105.8500][Warning] Skip this batch! (Loss: 104.69125366210938) Epoch: [9/1000]: 92%|████████████████████████▉ | 455/493 [01:05<00:05, 7.18it/s, Batch Loss: 104.6913][Warning] Skip this batch! (Loss: 112.86515808105469) Epoch: [9/1000]: 92%|████████████████████████▉ | 456/493 [01:05<00:05, 7.19it/s, Batch Loss: 112.8652][Warning] Skip this batch! (Loss: 103.87828826904297) Epoch: [9/1000]: 93%|█████████████████████████ | 457/493 [01:05<00:05, 7.19it/s, Batch Loss: 103.8783][Warning] Skip this batch! (Loss: 109.6042709350586) Epoch: [9/1000]: 93%|█████████████████████████ | 458/493 [01:05<00:04, 7.20it/s, Batch Loss: 109.6043][Warning] Skip this batch! (Loss: 108.07820129394531) Epoch: [9/1000]: 93%|█████████████████████████▏ | 459/493 [01:05<00:04, 7.20it/s, Batch Loss: 108.0782][Warning] Skip this batch! (Loss: 99.94971466064453) Epoch: [9/1000]: 93%|██████████████████████████▏ | 460/493 [01:05<00:04, 7.19it/s, Batch Loss: 99.9497][Warning] Skip this batch! (Loss: 105.28050231933594) Epoch: [9/1000]: 94%|█████████████████████████▏ | 461/493 [01:05<00:04, 7.25it/s, Batch Loss: 105.2805][Warning] Skip this batch! (Loss: 103.20718383789062) Epoch: [9/1000]: 94%|█████████████████████████▎ | 462/493 [01:06<00:04, 7.27it/s, Batch Loss: 103.2072][Warning] Skip this batch! (Loss: 115.111572265625) Epoch: [9/1000]: 94%|█████████████████████████▎ | 463/493 [01:06<00:04, 7.23it/s, Batch Loss: 115.1116][Warning] Skip this batch! (Loss: 114.798095703125) Epoch: [9/1000]: 94%|█████████████████████████▍ | 464/493 [01:06<00:04, 7.21it/s, Batch Loss: 114.7981][Warning] Skip this batch! (Loss: 99.00669860839844) Epoch: [9/1000]: 94%|██████████████████████████▍ | 465/493 [01:06<00:03, 7.19it/s, Batch Loss: 99.0067][Warning] Skip this batch! (Loss: 114.76979064941406) Epoch: [9/1000]: 95%|█████████████████████████▌ | 466/493 [01:06<00:03, 7.20it/s, Batch Loss: 114.7698][Warning] Skip this batch! (Loss: 101.20458221435547) Epoch: [9/1000]: 95%|█████████████████████████▌ | 467/493 [01:06<00:03, 7.20it/s, Batch Loss: 101.2046][Warning] Skip this batch! (Loss: 106.661376953125) Epoch: [9/1000]: 95%|█████████████████████████▋ | 468/493 [01:06<00:03, 7.23it/s, Batch Loss: 106.6614][Warning] Skip this batch! (Loss: 103.37936401367188) Epoch: [9/1000]: 95%|█████████████████████████▋ | 469/493 [01:06<00:03, 7.20it/s, Batch Loss: 103.3794][Warning] Skip this batch! (Loss: 106.76753997802734) Epoch: [9/1000]: 95%|█████████████████████████▋ | 470/493 [01:07<00:03, 7.19it/s, Batch Loss: 106.7675][Warning] Skip this batch! (Loss: 109.10973358154297) Epoch: [9/1000]: 96%|█████████████████████████▊ | 471/493 [01:07<00:03, 7.19it/s, Batch Loss: 109.1097][Warning] Skip this batch! (Loss: 107.09650421142578) Epoch: [9/1000]: 96%|█████████████████████████▊ | 472/493 [01:07<00:02, 7.17it/s, Batch Loss: 107.0965][Warning] Skip this batch! (Loss: 110.01425170898438) Epoch: [9/1000]: 96%|█████████████████████████▉ | 473/493 [01:07<00:02, 7.18it/s, Batch Loss: 110.0143][Warning] Skip this batch! (Loss: 100.29669189453125) Epoch: [9/1000]: 96%|█████████████████████████▉ | 474/493 [01:07<00:02, 7.16it/s, Batch Loss: 100.2967][Warning] Skip this batch! (Loss: 105.13536071777344) Epoch: [9/1000]: 96%|██████████████████████████ | 475/493 [01:07<00:02, 7.21it/s, Batch Loss: 105.1354][Warning] Skip this batch! (Loss: 100.22975158691406) Epoch: [9/1000]: 97%|██████████████████████████ | 476/493 [01:07<00:02, 7.25it/s, Batch Loss: 100.2298][Warning] Skip this batch! (Loss: 104.96470642089844) Epoch: [9/1000]: 97%|██████████████████████████ | 477/493 [01:08<00:02, 7.26it/s, Batch Loss: 104.9647][Warning] Skip this batch! (Loss: 108.01797485351562) Epoch: [9/1000]: 97%|██████████████████████████▏| 478/493 [01:08<00:02, 7.26it/s, Batch Loss: 108.0180][Warning] Skip this batch! (Loss: 104.71043395996094) Epoch: [9/1000]: 97%|██████████████████████████▏| 479/493 [01:08<00:01, 7.29it/s, Batch Loss: 104.7104][Warning] Skip this batch! (Loss: 107.51206970214844) Epoch: [9/1000]: 97%|██████████████████████████▎| 480/493 [01:08<00:01, 7.25it/s, Batch Loss: 107.5121]

Paper99 commented 5 years ago

This problem doesn't exist on my machines. I guess the reason causing your mentioned problem is that smaller error frequently occurs in x2 image SR task than x3 and x4 image SR tasks. Such smaller error will easily skip more batches during your training process.

THUS, one possible solution is increasing skip_threshold value to 10 or even much bigger. The setting about skip_threshold value is in this line: https://github.com/Paper99/SRFBN_CVPR19/blob/0b2f7f4d418f6580fd009a397762021d2deeaf1e/options/train/train_SRFBN_example.json#L53

urbaneman commented 5 years ago

I tried, but not work. So, I changed the datasets and the model trained well. However, I have no idea about what make the gradient exploded. Maybe the dataset have some mistakes. Fine, I will make experience in the new datasets, so let it go.😂 Thanks for your work and reply, again. This issue could close.

urbaneman commented 5 years ago

The original HR images have lots of noise. The different noise make the gradient exploded when the model is training. Because the X2 model is more sensitive than the X3 and X4. I guess.

Paper99 commented 5 years ago

If you have any further questions, please feel free to contact me.

urbaneman commented 5 years ago

Hi, @Paper99 Eventually I found out that the problem might be in Pytorch 1.0.0. And I update Pytorch to Pytorch1.0.1, the problem never appeared until now. I'm trying to reproduce the results of your paper. But it is hard to train. I find in you process dataset file , there have many data augmentation operations, are those operations neccessary? The dataset after data augmentation is so huge. If you use whole of them, how long did you spend when you train the model?
And do you use other tricks in training the model? Such as change the loss function, the lr change and the optimizer ..... If you can share, I will be so happy. Thanks so much.

Paper99 commented 5 years ago

To train the final model, we directly downsample each image in DIV2K+Flickr2K. Each epoch has about 1000 iterations. The learning rate of ADAM optimizer is set to 0.0001 firstly, and multiplies by 0.5 per 200 epochs.

urbaneman commented 5 years ago

I have saw them all in your codes. Is that all?Are there other tricks?

Sadly, after the data augmentation, the datasets are very huge and my SSD haven't enough space, I will retrain the model when I have a new SSD(maybe within a week), after train I will report the result. THANKS, again.

Paper99 commented 5 years ago

All configures are written in *.json.

No data augmentation (except random rotation and flip in LRHR_dataset.py) is used for training final models.

urbaneman commented 5 years ago

OK,get! Thanks, again.