RaymondWang987 / NVDS

ICCV 2023 "Neural Video Depth Stabilizer" (NVDS) & TPAMI 2024 "NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation" (NVDS+)
MIT License
491 stars 24 forks source link

Questions about the spatial loss when training? #24

Open onlyinheaven opened 10 months ago

onlyinheaven commented 10 months ago

Dear NVDS authors,

Thank you for publishing this outstanding work. However, I have some questions while reading your paper. Since the depth prediction network is fixed during the training of the stabilization network, I would like to understand why there is a spatial loss term L(t-1). According to my understanding, during inference, the stabilization network takes four depth inputs and outputs the depth for the target frame, without explicitly providing the depth for t-1. So, during training, why is there a spatial loss term L(t-1)? Does the stabilization network simultaneously output stabilization depth for all four frames? If not, does it involve inferring t-1 depth twice during each gradient backward pass – once for input t-4 to t-1, producing the depth for t-1, and another for input t-3 to t, producing the depth for t, and then calculating the loss?

Apart from this question, I would also like to understand how the temporal loss during training, which uses t-1 depth, is obtained.

Thank you for your clarification.