YapengTian / TDAN-VSR-CVPR-2020

TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution, CVPR 2020
MIT License
400 stars 62 forks source link

Questions about training #32

Closed mosquitobite closed 3 years ago

mosquitobite commented 3 years ago

Hi!

I've been working on another low-level vision task and also trying to use deformable convolution. Yet, I cannot get good results. I think a large portion of performance improvement of your paper comes from deformable convolution, so I am curious about how do you make it work? such as deformable alignment model structure design, training details, etc.

And I notice that the weight initialization method you use for class ConvOffset2d is uniform distribution, which is different from the original paper(which use all-zero initialization). Why? Does initialization matter here? And have you tried other initialization methods?

Actually, I found other guys are also facing above problems online. So I think it would be so nice if you can share some information.

Thanks a lot!

YapengTian commented 3 years ago

Hi!

Since you did not mention which project you are working on, I do not know how did you use the deformable convolution. First, I want to note that deformable convolution might not be helpful for all low-level vision problems, especially single image restoration. Here, we are using deformable convolution to perform temporal alignment for using temporal information. It has been demonstrated that deformable alignment can help several video restoration problems and reference-based image super-resolution.

For network design, one important trick is that we can use more deformable groups (=8) to perform deformable convolution. More deformable groups can help the network to explore more sampling positions enlarging the model capacity. The trick is also adopted by the EDVR. The deformable conv-based model can be trained end-to-end. So, it is pretty easy to train the network. One thing is that you might need to adjust your learning rates. Sometimes an overlarge lr makes learning unstable.

I used a public implementation of deformable conv. I did not explore different initialization methods. Neighboring pixels usually share the same contexts. First, the uniform initialization will not weaken the model comparing to the zero initialization. Second, the nonzero initialization will enforce the model to learn deformations.

Hope the answers can address your problems. If you have other questions, please let me know. Thanks!

mosquitobite commented 3 years ago

Thanks!

As you mentioned, it seems that deformable convolution is helpful to deformable alignment, whose network input is 2 or more images. So you mean single image vision tasks can't benefit from deformable convolution? Can you give some explaination?

To my understanding, deformable convolution can help learn object shapes or edges within the image, so it may help any low-level single image vision tasks which care about edges, such as single image restoration.

YapengTian commented 3 years ago

For video or reference-based restoration, deformable conv is used to sample supporting frames for aligning with the reference frame with the generated offsets from frame_supp + frame_ref. That is the reason why we use deformable conv rather than regular Conv (regular Conv has no capacity to address the alignment).

For a single image task, the deformable conv is used similarly as the regular convolution layer (if there is other way, please correct me). So, one benefit of the deformable conv is that it probably can use a large receptive field. But it might not really help since stacked deep 2d Conv models already can well exploited local contexts.

You mention that "deformable convolution can help learn object shapes or edges within the image, so it may help any low-level single image vision tasks which care about edges, such as single image restoration". However, the conclusion is from research on object detection or semantic segmentation, which might not be applicable to pixel-wise image restoration. In our TDAN paper (CVPR 2020 version) shows that the sampling usually tries to explore visual regions, which contain similar content with the target pixels rather than spanning over the whole object shape or just edges as shown in high-level recognition tasks. Why they can have the conclusion? (1) they use pre-trained models trained on imagenet, which capture high-level semantics; (2) they are high-level recognition problems, in which object structure is important clue to help the problems. With their training, the loss functions will enforce the dconv to capture objects. But, we do pixel-wise reconstruction without pre-trained visual models. I do not think the L2/L1 reconstruction loss will enforce the models to capture the objects inside images. These are just my thoughts based on my experience.

mosquitobite commented 3 years ago

Thanks for sharing! It do enlighten me!

YapengTian commented 3 years ago

No problem!