Training problem - Githubissues

aiff22 / PyNET-PyTorch

Generating RGB photos from RAW image files with PyNET (PyTorch)

http://www.vision.ee.ethz.ch/~ihnatova/pynet.html

Other

346 stars 56 forks source link

Training problem #18

Closed phexic closed 4 years ago

phexic commented 4 years ago

I met the problem which Psnr and Mse not changed during level3's training.

The following parameters will be applied for CNN training:
Training level: 3
Batch size: 4
Learning rate: 5e-05
Training epochs: 17
Restore epoch: 7
CUDA visible devices: 1
CUDA Device Name: Tesla T4
Epoch 0, mse: 0.0751, psnr: 11.9902, vgg: 0.5996
Epoch 1, mse: 0.3079, psnr: 6.6111, vgg: 0.5348
Epoch 2, mse: 0.3079, psnr: 6.6111, vgg: 0.5348
Epoch 3, mse: 0.3079, psnr: 6.6111, vgg: 0.5348
Epoch 4, mse: 0.3079, psnr: 6.6111, vgg: 0.5348
Epoch 5, mse: 0.3079, psnr: 6.6111, vgg: 0.5348
Epoch 6, mse: 0.3079, psnr: 6.6111, vgg: 0.5348
Epoch 7, mse: 0.3079, psnr: 6.6111, vgg: 0.5348
Epoch 8, mse: 0.3079, psnr: 6.6111, vgg: 0.5348
Epoch 9, mse: 0.3079, psnr: 6.6111, vgg: 0.5348

and visual results are all black, how it happened?

aiff22 commented 4 years ago

Hi @phexic,

Based on the logs, it seems that your model is diverging, most likely because of the very small batch size. You can try to increase it by reducing the size of the training patches.

phexic commented 4 years ago

Thank you for your prompt reply! I did try batchsize 24 by using 2 V100(32g), howerver the same problem occured that mse（0.3079） did not change and psnr remained 6.6111. Also, during training epoch 1, mse suddenly increases and training result suddenly turns completely black(about from 700+ batchidx). By print hidden layer gradient，It seems gradient explodes (NAN). At the same time, I have checked my custom training data, all training pairs's mse below 0.07 which is within the normal range. What do you think might be the cause? Btw, line 58 in load_data.py

dslr_image = np.float32(misc.imresize(dslr_image, self.scale / 2.0)) / 255.0

Is it not necessary that self.scale / 2.0 because network in paper "Replacing Mobile Camera ISP with a Single Deep Learning Model " only has 4 downsample layer differs from "Bokeh..".

tkopetz commented 2 years ago

Hi @phexic , did you fix this problem? I got similar results: Epoch 0, mse: 0.2329, psnr: 6.3279, ms-ssim: 0.5204 Epoch 1, mse: 0.0000, psnr: 54.1944, ms-ssim: 0.9943 Epoch 2, mse: 0.0000, psnr: 54.1944, ms-ssim: 0.9943 Epoch 3, mse: 0.0000, psnr: 54.1944, ms-ssim: 0.9943 Epoch 4, mse: 0.0000, psnr: 54.1944, ms-ssim: 0.9943 Epoch 5, mse: 0.0000, psnr: 54.1944, ms-ssim: 0.9943 Epoch 6, mse: 0.0000, psnr: 54.1944, ms-ssim: 0.9943