Why 3D defromable conv was performed along "HW" instead of "THW" in your model?

jianpengz commented 3 years ago

Thanks for the repo. I wonder why you employed 3D deformable convolution along only "HW" dimensions (self.dcn0 = DeformConvPack_d(nf, nf, kernel_size=3, stride=1, padding=1, dimension='HW') in model.py file), instead of "THW" in your model?

XinyiYing commented 3 years ago

Thanks for this comment. We have conducted several experiments to investigate the influence of deformation in temporal dimension. Specifically, we relaxed the temporal deformation constraint to perform deformations in all three dimensions and retrained D3Dnet with temporal deformation (i.e., D3Dnet-T) from scratch. The training settings were kept identical to our paper. The quantitative results (i.e., PSNR/SSIM, MOVIE/T-MOVIE) are reported in Tables 1 and 2, respectively.

As is illustrated in Tables 1 and 2, the SR performable and temporal consistency of D3Dnet-T and D3Dnet are comparable. Their performance gap is minor (e.g. 0.01/0.001 in PSNR/SSIM and 0.10/0.02 in MOVIE/T-MOVIE). That is because, the deformation in temporal dimension can be considered as the frame selection operation. Since the temporal sliding window operation used in D3Dnet can already incorporate the temporal prior (i.e., frames temporally closer to the reference frame are more important [R1], [R2]), frame selection along temporal dimension cannot introduce significant performance improvement.

We have conducted additional experiments to investigate the temporal offsets in D3Dnet-T both quantitatively and qualitatively. Specifically, we first summed the temporal offsets of each D3D in D3Dnet-T to generate the global temporal offset ∈R^(27×T×H×W) and then calculated the mean and standard deviation along its channel dimension to generate T mean offsets ∈R^(H×W) and T standard offsets∈R^(H×W). Here, T=7 represents the number of input frames. We chose “Calendar 12” and “Walk 12” in the Vid4 dataset as reference frames to investigate the influence of motion degree (small motion in “Calendar 9-15” and large motion in “Walk 9-15”). The mean and standard offsets are shown in Fig. I. We also list the mean and standard values of each mean offset and standard offset in Tables 3 and 4.

As illustrated in Fig. I, the mean and standard offsets in both “Calendar” and “Walk” are small (lower than 1 and 3 respectively), which means that the centroid shift of the 3×3×3 convolution kernel is small and each kernel is densely distributed. That is, D3Dnet-T acts as D3Dnet by estimating its temporal offsets close to 0. Moreover, the values listed in Tables 3 and 4 also demonstrate the aforementioned theory. Note that, the difference of mean and standard values between “Calendar” and “Walk” are small, which demonstrates that motion degree has little impact on temporal offsets.

In addition, we evaluated the computational efficiency (the number of parameters and FLOPs) between D3Dnet-T and D3Dnet and the results are shown in Table 5.

As illustrated in Table 5, the number of parameters and FLOPs of D3Dnet-T are 1.18 times and 1.46 times larger than those of D3Dnet. Above all, the deformations in temporal dimension cannot introduce significant performance improvement but reduces the computational efficiency. Consequently, we decided to not perform temporal deformation in our D3Dnet.

[R1] T. Isobe, S. Li, X. Jia, S. Yuan, G. Slabaugh, C. Xu, Y.-L. Li, S. Wang, and Q. Tian, “Video super-resolution with temporal group attention,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2020. [R2] X. Wang, K. C. Chan, K. Yu, C. Dong, and C. Change Loy, “EDVR: Video restoration with enhanced deformable convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.

jianpengz commented 3 years ago

Thanks for your detailed reply.

SQMah commented 6 months ago

Not to open back an old discussion, but would you have the code for D3Dnet-T? Maybe not useful in video super resolution but I think it would be useful in 3d image segmentation, for example in medical image segmentation.

XinyiYing / D3Dnet

Why 3D defromable conv was performed along "HW" instead of "THW" in your model? #10