Questions about the spatial loss when training？

Dear NVDS authors,

Thank you for publishing this outstanding work. However, I have some questions while reading your paper. Since the depth prediction network is fixed during the training of the stabilization network, I would like to understand why there is a spatial loss term L(t-1). According to my understanding, during inference, the stabilization network takes four depth inputs and outputs the depth for the target frame, without explicitly providing the depth for t-1. So, during training, why is there a spatial loss term L(t-1)? Does the stabilization network simultaneously output stabilization depth for all four frames? If not, does it involve inferring t-1 depth twice during each gradient backward pass – once for input t-4 to t-1, producing the depth for t-1, and another for input t-3 to t, producing the depth for t, and then calculating the loss?

Apart from this question, I would also like to understand how the temporal loss during training, which uses t-1 depth, is obtained.

Thank you for your clarification.

RaymondWang987 / NVDS

Questions about the spatial loss when training？ #24