Open fishfuck opened 5 days ago
Hi,
Thank u for your answer ! ! I have another question: Are the depth scales of the panoramic views all supervised only by the RGB images rendered from adjacent views? Is there anything else designed for this? Because it feels like such supervision would be very weak.
Yes, the small overlap region is sufficient to provide the scale information for the training. We do not have other design.
Hi, Thanks for your great work ! ! !
I want to ask that,
I didn't fully understand the design of your cross view loss. As far as I understand, you extract features from six panoramic images and decode them as Gaussian parameters, then do unprojection through the masked images, followed by splatting. Then you calculate the loss between the rendering images and the original images, right? My questions are: How is the scale ensured in this process? And when calculating the loss, is it based on the original images or the masked images? Are images from adjacent time frames used?
Are Stage 1 and Stage 2 related? Does Stage 1 only train the 6D pose network and the 2D encoder? Then, according to my understanding, the entire process could also be completed through Stage 2 alone, as the task of Stage 1 is unrelated to Occupancy Estimation.
Looking forward to your reply with sincere anticipation.