What's your WDSR Baseline(in table 1)'s option?

JiahuiYu / wdsr_ntire2018

Code of our winning entry to NTIRE super-resolution challenge, CVPR 2018

http://www.vision.ee.ethz.ch/ntire18/

598 stars 123 forks source link

What's your WDSR Baseline(in table 1)'s option? #12

Closed splinter21 closed 5 years ago

splinter21 commented 5 years ago

What's your WDSR Baseline(in table 1)'s option? n_res_blocks? n_feats? WDSR-A or WDSR-B? if the option is not the same, I don't konw what causes the better performence...

Also, in table 2, I don't know the better performence is caused by the difference of blocks or by the global residual pathway and upsampler.

JiahuiYu commented 5 years ago

@splinter22 In table 1, we use n_res_blocks=16, n_feats=32, n_block_feats=128, WDSR-A with 16 residual blocks (WDSR-B is more efficient especially with fewer residual blocks.), compared with EDSR n_feats=64 with 16 residual blocks. They all have similar parameters/computation.

The same applies to table 2, better performance is NOT caused by the difference of blocks since they have same blocks as indicated in the table "Number of Residual Blocks".

The performance gain is either caused by the difference of blocks or by the global residual pathway/upsampler. Probably the factsheet may help you to understand the performance gain.

splinter21 commented 5 years ago

Thank you, I know. 32x128=64x64, so the parameters and computation time is the same. And also the number of blocks is the same compared with EDSR baseline. I think you can do an experiment that only change the res_blocks to WDSR-A Ver, not use the global global residual pathway and use the same upsampler, and if the performence is 0.1+ PSNR higher, it's sure that the blocks of WDSR-A is better than blocks in EDSR when the option is in baseline, or it may be the changes of global residual pathway and upsampler that cause the better performence.

JiahuiYu commented 5 years ago

@splinter22 I got your point now. Certainly we did have lots of experiments with the case you indicate, and other cases you have not mentioned. The results is, if we use same global pathway and/or up-sampler, the performance gain is even more than 0.1+ PSNR in many cases.

Please have a try if you think it is counterintuitive to you. The original up-sampler has much more parameters than ours, and it does help the performance a little. We remove it to keep efficiency.

Thanks for your interest in our work - I feel that you have carefully read our report and try to understand the reason behind it.

splinter21 commented 5 years ago

I changed the edsr block to wdsr-a block in the option of 96-96*4 n_feats and 32 n_blocks from edsr network without changing other module, compared with edsr in the ootion of 192 n_feats and 32 n_blocks, and it got a better result (even better than edsr whose option in the paper is 256 n_feats with 32 n_blocks)! So it confirmed that wdsr-a can also work in a very deep network! Fantastic work!

splinter21 commented 5 years ago

But only a little bit better , smaller than 0.1PSNR ... Maybe I can't say which is better, just within the margin of error... It's much better when the depth of the network is not so big.

splinter21 commented 5 years ago

And another conclusion is that n_feats=128 is enough when n_blocks is less than 32， just about 0.05- PSNR performance compared with the same other option of edsr-block-style network.

JiahuiYu commented 5 years ago

Weight normalization is also an important component in our work. Please have a try if you have interests. Please also remember to use a 10x higher learning rate when using weight normalization.

splinter21 commented 5 years ago

I have tried wn with 10x learning rate. I found wn works better in a more "baseline" model. Convergence time, faster, and ,when convergence, the best PSNR is a little higher. But in 32-256 option, when convergence, the best PSNR is almost the same as the model without wn.

splinter21 commented 5 years ago

What does this line in your code mean?

torch.nn.ReplicationPad2d(5//2) (when testing)

JiahuiYu commented 5 years ago

Because we did not pad the last convolution, we need to pad at test time. At training time, please also crop HR patch to the same size of network output.

https://github.com/JiahuiYu/wdsr_ntire2018/blob/master/wdsr_a.py#L61-L67

JiahuiYu commented 5 years ago

We have removed explicit padding.