Is it possible to adopt your temporal loss on other video tasks?

RaymondWang987 commented 2 years ago

Hi! Thanks for the great work and releasing the code.

My question is that is it possible to adopt your temporal loss on other video tasks such as video semantic segmentation and video depth estimation? In those areas, most temporal losses are based on the optical flow warping loss, which is quite time consuming while training. Your temporal loss are used on RGB outputs. Is it possible to be extended to semantic results or depth maps?

By the way, is the temporal_loss_mode == 2 worse than temporal_loss_mode == 1 in your case? What's the reason for that case?

        ## use multi-scale relation-based loss
        elif args.temporal_loss_mode == 1:
            # blur image/area statistics/intensity
            # k_sizes = [1, 3, 5, 7]
            k_sizes = args.k_sizes
            gt_errors = []
            out_errors = []

            for i in range(len(k_sizes)):
                k_size = k_sizes[i]
                avg_blur = nn.AvgPool2d(k_size, stride=1, padding=int((k_size - 1) / 2))
                gt_error = avg_blur(label) - avg_blur(label_1)
                out_error = avg_blur(out_img) - avg_blur(out_img_1)
                gt_errors.append(gt_error)
                out_errors.append(out_error)

            gt_error_rgb_pixel_min = gt_errors[0]
            out_error_rgb_pixel_min = out_errors[0]

            for j in range(1, len(k_sizes)):
                gt_error_rgb_pixel_min = torch.where(torch.abs(out_error_rgb_pixel_min) < torch.abs(out_errors[j]),
                        gt_error_rgb_pixel_min, gt_errors[j])
                out_error_rgb_pixel_min = torch.where(torch.abs(out_error_rgb_pixel_min) < torch.abs(out_errors[j]),
                        out_error_rgb_pixel_min, out_errors[j])

            loss_temporal = F.l1_loss(gt_error_rgb_pixel_min, out_error_rgb_pixel_min)

        ## Alternatively, combine relation-based loss at different scales with different weights
        elif args.temporal_loss_mode == 2:
            # blur image/area statistics/intensity
            # k_sizes = [1, 3, 5, 7]
            k_sizes = args.k_sizes
            # k_weights = [0.25, 0.25, 0.25, 0.25]
            k_weights = args.k_weights
            loss_temporal = 0*loss

            for i in range(len(k_sizes)):
                k_size = k_sizes[i]
                k_weight = k_weights[i]
                avg_blur = nn.AvgPool2d(k_size, stride=1, padding=int((k_size - 1) / 2))
                gt_error = avg_blur(label) - avg_blur(label_1)
                out_error = avg_blur(out_img) - avg_blur(out_img_1)
                loss_temporal = loss_temporal + F.l1_loss(gt_error, out_error) * k_weight

daipengwa commented 2 years ago

We have not tried our method on depth/semantic prediction tasks; but it is easy to implement, you can try it.
The mode 2 requires to manually adjust weights which is inconvenient, and inappropriate weights will lead to worse results.

zhanghongyong123456 commented 2 years ago

We have not tried our method on depth/semantic prediction tasks; but it is easy to implement, you can try it.

The mode 2 requires to manually adjust weights which is inconvenient, and inappropriate weights will lead to worse results.

When temporal loss is applied to other projects, the network frame of the other project is not changed, and i just add temporal loss at the end, will it have any effect? Because I see that in our project, in the network architecture, we have added auxiliary images in the network frame?

for more frames, I found that it is necessary to calculate the smallest error of loss instead of all errors. Is the error of many frames improved, Don't understand why you choose the smallest error instead of the largest or all

for λt must be equal to 50 ? ,and k_sizes: [1, 3, 5, 7] ; k_weights: [0.25, 0.25, 0.25, 0.25]( in mode 2 ) Is there any adjustment basis?

daipengwa commented 2 years ago

This is a regularization, does not affected by the structure.
For position that is already consistent, there is no need to penalize it with a large loss value, caused by nearby flickering pixels/positions.
They are hyperparameters, you can change them.

zhanghongyong123456 commented 2 years ago

This is a regularization, does not affected by the structure.

For position that is already consistent, there is no need to penalize it with a large loss value, caused by nearby flickering pixels/positions.

They are hyperparameters, you can change them.

Thank you very much for your advice, i try it

RaymondWang987 commented 2 years ago

We have not tried our method on depth/semantic prediction tasks; but it is easy to implement, you can try it.

The mode 2 requires to manually adjust weights which is inconvenient, and inappropriate weights will lead to worse results.

I have tried your temporal loss on video depth estimation. I use your loss on frame i and i+3 with the model prediction and the depth ground truth. I do not change any other parameters. I find that your loss does not work well in my case. The loss seems not to affect the training process obviously, which means that the model performance(temporal consistency) does not obviously change to be better or worse by adding the loss. I assume there might be 2 reasons:

1) The parameters on your task is not suitable for depth estimation 2) Your temporal loss highly rely on temporally consistent ground truth (naturally consistent RGB video frames in your case). However, in other tasks such as video depth estimation and video semantic segmentation, the ground truth always exists flicker and inconsistency to some extend. For example, you can observe obvious flickering on the ground truth of NYU Depth V2 dataset(such as bathroom0030 or basement0001a/b/c). In this case, this loss cannot work well facing with the inconsistent ground truth in other tasks. You cannot use a flickering (gti - gt(i+1)) to achieve a consistent (predi - pred(i+1)).

daipengwa commented 2 years ago

We have not tried our method on depth/semantic prediction tasks; but it is easy to implement, you can try it.

The mode 2 requires to manually adjust weights which is inconvenient, and inappropriate weights will lead to worse results.

I have tried your temporal loss on video depth estimation. I use your loss on frame i and i+3 with the model prediction and the depth ground truth. I do not change any other parameters. I find that your loss does not work well in my case. The loss seems not to affect the training process obviously, which means that the model performance(temporal consistency) does not obviously change to be better or worse by adding the loss. I assume there might be 2 reasons:

The parameters on your task is not suitable for depth estimation

Your temporal loss highly rely on temporally consistent ground truth (naturally consistent RGB video frames in your case). However, in other tasks such as video depth estimation and video semantic segmentation, the ground truth always exists flicker and inconsistency to some extend. For example, you can observe obvious flickering on the ground truth of NYU Depth V2 dataset(such as bathroom0030 or basement0001a/b/c). In this case, this loss cannot work well facing with the inconsistent ground truth in other tasks. You cannot use a flickering (gti - gt(i+1)) to achieve a consistent (predi - pred(i+1)).

you can increase the weight and do some debug. check the loss is decreased? compare the change between two frames, is the change closer to the change of GT frames (training and testing frames) with and without using the consistency loss. Maybe you can begin with the basic one, then add multi-scale designs.
Yes, the consistency of GT has an impact, since we learn to perform like the changes in GT. Maybe the GT should be pre-processed to remove the obvious flickers. In NYUv2, the GT depth is captured via depth sensor, the flickering is caused by the precision of depth sensor?

zhanghongyong123456 commented 2 years ago

2. Maybe the GT should be pre-processed to remove the obvious flickers

Maybe the GT should be pre-processed to remove the obvious flickers,

1.How to reduce flicker, what is the way（pre-processed ） to deal with it

I use continuous frames to calculate the basic error（）, and there is no obvious effect. I don't know why, my GT mask is not flickering,
Compared with other losses, what should the loss look like, the same level or one level higher, or one level lower first:new temporal loss, The rest is the original loss of the project，how to config about lamda （default = 50 ），0.05 ,0.5 or 500,which one should i choose

RaymondWang987 commented 2 years ago

We have not tried our method on depth/semantic prediction tasks; but it is easy to implement, you can try it.

The mode 2 requires to manually adjust weights which is inconvenient, and inappropriate weights will lead to worse results.

I have tried your temporal loss on video depth estimation. I use your loss on frame i and i+3 with the model prediction and the depth ground truth. I do not change any other parameters. I find that your loss does not work well in my case. The loss seems not to affect the training process obviously, which means that the model performance(temporal consistency) does not obviously change to be better or worse by adding the loss. I assume there might be 2 reasons:

The parameters on your task is not suitable for depth estimation

Your temporal loss highly rely on temporally consistent ground truth (naturally consistent RGB video frames in your case). However, in other tasks such as video depth estimation and video semantic segmentation, the ground truth always exists flicker and inconsistency to some extend. For example, you can observe obvious flickering on the ground truth of NYU Depth V2 dataset(such as bathroom0030 or basement0001a/b/c). In this case, this loss cannot work well facing with the inconsistent ground truth in other tasks. You cannot use a flickering (gti - gt(i+1)) to achieve a consistent (predi - pred(i+1)).

you can increase the weight and do some debug. check the loss is decreased? compare the change between two frames, is the change closer to the change of GT frames (training and testing frames) with and without using the consistency loss. Maybe you can begin with the basic one, then add multi-scale designs.

Yes, the consistency of GT has an impact, since we learn to perform like the changes in GT. Maybe the GT should be pre-processed to remove the obvious flickers. In NYUv2, the GT depth is captured via depth sensor, the flickering is caused by the precision of depth sensor?

I conducted this experiment two or three weeks ago. I remember that the loss is decreased at the beginning of training, but I cannot see obvious improvement (consistency/flickering) in the final prediction qualitative results. I also tried to train the model without the temporal loss for supervision but test and output the loss value of each batch. The loss value changing looks quite similar with/without the loss as supervision. I might do more debug or adjust the weight later.
The flickering of depth ground-truth could be caused by several reasons: (1) The random inaccuracy and error. This could be caused by different depth sensors with different accuracy level. For example, the bathroom scene of NYUDV2 has many mirrors in the videos. You can observe obvious depth flickering in those mirror reflection areas. (2) Some ground-truth of several datasets have holes in the depth maps, such as the NYUDV2 (dense). The KITTI dataset is even sparse. In those dataset, conventional pre-processing protocal is to conduct depth completion to fill those holes. However, the depth completion could cause flickering. (3) Some methods such as Midas and DPT generates their own depth datasets by optical flow and 3D movies. The disparsity maps generated by optical flow inevitably exist with temporal flickering to some extend.

CVMI-Lab / VideoDemoireing

Is it possible to adopt your temporal loss on other video tasks? #3