5ofwind / RDVR

[2024 TCSVT] Video Rescaling with Recurrent Diffusion
Apache License 2.0
5 stars 0 forks source link

How to enable backpropagation for H265 #1

Closed hahazh closed 4 months ago

hahazh commented 4 months ago

Hello, very excellent work! I have a small question. In your paper, under the section "EXPERIMENTS D Application to Video Compression," you mention, "At the second stage, all the components in Fig. 12 are jointly trained for 50000 iterations (about 12.4 epochs)." Since H265 is not differentiable, I would like to know how you managed to implement joint optimization with backpropagation. Did you directly adopt the method from SelfC? Additionally, now that the paper has been accepted, when can the code be made open source?

5ofwind commented 4 months ago

Hello. Yes, we follow the settings in SelfC. We adopt the surrogate network and codec loss (L_{codec}) in SelfC (for approaching H.265), and use automatic differentiation of PyTorch during training. In our experiments, at the second stage we employ the surrogate network in the first 25000 iterations. We find that by removing the surrogate network and codec loss in the next 25000 iterations, the performance can be slightly better.

We are now carefully checking the codes. The testing codes, training codes and trained models will released as soon as possible.

5ofwind commented 4 months ago

Hi hahazh. We have uploaded our codes.

Inspired by your comments, in recent days we did a lot of experiments and found that by removing the codec loss for the surrogate network in SelfC, the performance could be almost the same or better. In our final codes, we set the coefficient of the codec loss for the surrogate network to zero. Perhaps we can connect the outputs of the surrogate network with the decoding system to activate the surrogate network.

The whole system can be trained since we have the loss for low-resolution frames which can supervise the downsampling network. What's more, the parameters of the downsampling network and the pre-upsampling network are shared. When the parameters of the pre-upsampling networks are changed, the parameters of the downsampling network will also be changed.

For the fast implementation, we only use 50000 iterations for Stage 2. The performance may be further improved by employing more iterations.

Best regards.

hahazh commented 4 months ago

Dear Dingyi: Thank you for your timely and informative response. I have a few more questions I would like to discuss with you:

1.Why did you divide stage 2 into two sub-stages? From my understanding, the results of the first 25,000 iterations are for bicubically downsampled images (without quantization noise), while the latter 25,000 iterations consider quantization noise. Is this a progressive learning strategy, going from easy to difficult?

2.Why did you use different downsampling methods for the low-resolution supervision in stage 1 and stage 2? (Stage 1 used BD, and the first 25,000 iterations of stage 2 used BI). Does using different downsampling methods to supervise the downsampling network's output have a significant impact on the SR performance? We know that for video super-resolution tasks, the results of BD downsampling are often better than BI downsampling (perhaps because the Gaussian kernel downsampling has a wider coverage and implicitly embeds more information helpful for restoration). Does this conclusion still hold for video rescaling?

3.Neither the $L_{codec}$ in SelfC nor your current approach explicitly considers the bitrate, while end-to-end compression has a Rate term as a bitrate constraint, such as the entropy model in cheng2020[1]. Additionally, some image rescaling tasks for compression[2,3] have also introduced similar explicit bitrate loss terms. Do you think the bitrate constraint is necessary for the video rescaling task specifically? Based on your experimental results, if encoding is merely a sub-component of the video rescaling task, i.e., to better adapt to the compression artifacts in real-world applications, is the explicit bitrate constraint unnecessary?

My questions may be a bit abrupt, and some require further experimentations to verify. I am also working on video SR, video frame interpolation, and end-to-end compression-related projects, and I will continue to explore the above issues. I would be very pleased to have more discussions with you, and thank you again for your help!

ref: [1]Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules [2]Video Compression based on Jointly Learned Down-Sampling and Super-Resolution Networks [3] An Efficient Content-aware Downsampling-based Video Compression Framework

Best regards.

5ofwind commented 4 months ago

Hello. We have checked and modified the descriptions in README.md. Please see the revised contents. We will answer your questions detailedly later. Thank you for sharing the valuable references with us.

RDVR-H265: Combing our basic RDVR with H.265 video compression. There are two stages in trainning. For stage 1 we train a basic RDVR for scale factor of 2 with BD downsampling for 250000 iterations. The second stage has 50000 iterations with H.265 video compression. At the second stage (with 50000 iterations in total), BD downsampling is also utilized. We apply the BD downsampled frames for the inputs of H.265 encoding in the first 25000 iterations. We use the outputs of the downsampling network for the inputs of H.265 encoding in the next 25000 iterations at Stage 2. We find that the two-step approach at Stage 2 leads to slight improvement in terms of MS-SSIM (about 0.0002 on average), compared with the situation that Step 1 is removed and Step 2 has 50000 iterations. Note that our final model is "RDVR-H265-Stage2-Step2".

5ofwind commented 4 months ago

Hi. As you can see in README.md, I have revised the descriptions of the details.

RDVR-H265: Combing our basic RDVR with H.265 video compression. There are two stages in trainning. For stage 1 we train a basic RDVR for scale factor of 2 with BD downsampling for 250000 iterations. The second stage has 50000 iterations with H.265 video compression. At the second stage (with 50000 iterations in total), BD downsampling is also utilized. We apply the BD downsampled frames for the inputs of H.265 encoding in the first 25000 iterations. We use the outputs of the downsampling network for the inputs of H.265 encoding in the next 25000 iterations at Stage 2. We find that the two-step approach at Stage 2 leads to slight improvement in terms of MS-SSIM (about 0.0002 on average), compared with the situation that Step 1 is removed and Step 2 has 50000 iterations. Note that our final model is "RDVR-H265-Stage2-Step2".

Here are my answers to your questions.

  1. We find that a two-step approach at Stage 2 leads to small improvement in terms of MS-SSIM (0.0002 on average), compared with the situation that Step 1 is removed and Step 2 has 50000 iterations. Perhaps the reason is that Step 1 is a good pre-training for reducing the learning difficulty at Step 2. In our final codes we use the two-step strategy at Stage 2.

  2. In our experiments, we adopt Bicubic (BI) downsampling for RDVR, RDVR+ and RDVR++, for a fair comparison with MIMO-VRN and CLSA. We consistently employ BD downsampling for RDVR + H.265 for a fair comparison with SelfC + H.265. In video super-resolution, the PSNR and SSIM values of the results for BD downsampling are usually larger than those of BI downsampling. We find that in our experiments for RDVR and RDVR + H.265, this phenomenon also occurs at least for the scale factor of 2.

  3. We didn’t introduce any bitrate loss in our previous experiments. We don’t know how bitrate loss can affect the performance of video rescaling + H.265. I guess that the bitrate loss can improve the performance, but the increase may be small. Since the low-resolution frames are forced to be similar to interpolation-based downsampled images, the bpp values of the H.265 encoded video may be in a small range for different models under the same scale factor. Note that the H.265 encoding procedure is fixed while the parameters of other learning-based video compression models can be modified by the bitrate loss. We are also wondering about the real effect of bitrate loss on video rescaling + video compression. Previously we read the paper of HyperThumbnail which used a bitrate loss. We didn’t find an ablation study on the loss. In the future, once your find or get some experimental results, please share with us if it is possible.

The experiments on video compression were difficult and took about 40 days. I hope that the performance of video rescaling + video compression can be further improved in the future by tuning parameters or designing new methods.

I’m glad to talk with you. You can contact with me here or through E-mail: lidingyi@njust.edu.cn. I hope that we can share information, learn from each other and improve ourselves.

Best regards.

hahazh commented 4 months ago

Thank you for your detailed response. I am also happy to keep in touch with you and share relevant experimental information at any time.