Open RaymondWang987 opened 2 years ago
We have not tried our method on depth/semantic prediction tasks; but it is easy to implement, you can try it.
The mode 2 requires to manually adjust weights which is inconvenient, and inappropriate weights will lead to worse results.
- We have not tried our method on depth/semantic prediction tasks; but it is easy to implement, you can try it.
- The mode 2 requires to manually adjust weights which is inconvenient, and inappropriate weights will lead to worse results.
- This is a regularization, does not affected by the structure.
- For position that is already consistent, there is no need to penalize it with a large loss value, caused by nearby flickering pixels/positions.
- They are hyperparameters, you can change them.
Thank you very much for your advice, i try it
- We have not tried our method on depth/semantic prediction tasks; but it is easy to implement, you can try it.
- The mode 2 requires to manually adjust weights which is inconvenient, and inappropriate weights will lead to worse results.
I have tried your temporal loss on video depth estimation. I use your loss on frame i and i+3 with the model prediction and the depth ground truth. I do not change any other parameters. I find that your loss does not work well in my case. The loss seems not to affect the training process obviously, which means that the model performance(temporal consistency) does not obviously change to be better or worse by adding the loss. I assume there might be 2 reasons:
1) The parameters on your task is not suitable for depth estimation 2) Your temporal loss highly rely on temporally consistent ground truth (naturally consistent RGB video frames in your case). However, in other tasks such as video depth estimation and video semantic segmentation, the ground truth always exists flicker and inconsistency to some extend. For example, you can observe obvious flickering on the ground truth of NYU Depth V2 dataset(such as bathroom0030 or basement0001a/b/c). In this case, this loss cannot work well facing with the inconsistent ground truth in other tasks. You cannot use a flickering (gti - gt(i+1)) to achieve a consistent (predi - pred(i+1)).
- We have not tried our method on depth/semantic prediction tasks; but it is easy to implement, you can try it.
- The mode 2 requires to manually adjust weights which is inconvenient, and inappropriate weights will lead to worse results.
I have tried your temporal loss on video depth estimation. I use your loss on frame i and i+3 with the model prediction and the depth ground truth. I do not change any other parameters. I find that your loss does not work well in my case. The loss seems not to affect the training process obviously, which means that the model performance(temporal consistency) does not obviously change to be better or worse by adding the loss. I assume there might be 2 reasons:
- The parameters on your task is not suitable for depth estimation
- Your temporal loss highly rely on temporally consistent ground truth (naturally consistent RGB video frames in your case). However, in other tasks such as video depth estimation and video semantic segmentation, the ground truth always exists flicker and inconsistency to some extend. For example, you can observe obvious flickering on the ground truth of NYU Depth V2 dataset(such as bathroom0030 or basement0001a/b/c). In this case, this loss cannot work well facing with the inconsistent ground truth in other tasks. You cannot use a flickering (gti - gt(i+1)) to achieve a consistent (predi - pred(i+1)).
2. Maybe the GT should be pre-processed to remove the obvious flickers
Maybe the GT should be pre-processed to remove the obvious flickers,
1.How to reduce flicker, what is the way(pre-processed ) to deal with it
- We have not tried our method on depth/semantic prediction tasks; but it is easy to implement, you can try it.
- The mode 2 requires to manually adjust weights which is inconvenient, and inappropriate weights will lead to worse results.
I have tried your temporal loss on video depth estimation. I use your loss on frame i and i+3 with the model prediction and the depth ground truth. I do not change any other parameters. I find that your loss does not work well in my case. The loss seems not to affect the training process obviously, which means that the model performance(temporal consistency) does not obviously change to be better or worse by adding the loss. I assume there might be 2 reasons:
- The parameters on your task is not suitable for depth estimation
- Your temporal loss highly rely on temporally consistent ground truth (naturally consistent RGB video frames in your case). However, in other tasks such as video depth estimation and video semantic segmentation, the ground truth always exists flicker and inconsistency to some extend. For example, you can observe obvious flickering on the ground truth of NYU Depth V2 dataset(such as bathroom0030 or basement0001a/b/c). In this case, this loss cannot work well facing with the inconsistent ground truth in other tasks. You cannot use a flickering (gti - gt(i+1)) to achieve a consistent (predi - pred(i+1)).
- you can increase the weight and do some debug. check the loss is decreased? compare the change between two frames, is the change closer to the change of GT frames (training and testing frames) with and without using the consistency loss. Maybe you can begin with the basic one, then add multi-scale designs.
- Yes, the consistency of GT has an impact, since we learn to perform like the changes in GT. Maybe the GT should be pre-processed to remove the obvious flickers. In NYUv2, the GT depth is captured via depth sensor, the flickering is caused by the precision of depth sensor?
Hi! Thanks for the great work and releasing the code.
My question is that is it possible to adopt your temporal loss on other video tasks such as video semantic segmentation and video depth estimation? In those areas, most temporal losses are based on the optical flow warping loss, which is quite time consuming while training. Your temporal loss are used on RGB outputs. Is it possible to be extended to semantic results or depth maps?
By the way, is the temporal_loss_mode == 2 worse than temporal_loss_mode == 1 in your case? What's the reason for that case?