Closed xiayizhan2017 closed 5 years ago
@xiayizhan2017 for 1: I think otherwise the number of parameters in the network will become unnecessarily huge, thus harder to train. for 2: I do not fully understand the question, but nn.ConvTranspose2d() is a standard way of doing de-convolution in an encoder-decoder network.
As for 2, different types of upsampling can be implemented using transposed convolutions, one just needs to initialize the kernel appropriately and fix its weights. For a possible kernel that performs bilinear upsampling, you can have a look at: http://caffe.berkeleyvision.org/doxygen/classcaffe_1_1BilinearFiller.html
For 2, we tested both and found that using fixed bilinear upsampling, the training converges faster in the beginning. However, with enough iterations, ConvTranspose2d catches up and is slightly better. My guess is that making the weights learnable results in a slightly larger-capacity model.
This is a very elegant work.I have some doubts,and I hope to get help. 1.The highest scale features used in the program are 1/4, why not use the original scale features? 2.Why does the optical flow upsampling not use linear interpolation, but use nn.ConvTranspose2d, through the layer with variables, the optical flow will not find changes?
Thanks!