Some questions about understanding the paper

fishfuck commented 5 days ago

Hi, Thanks for your great work ! ! !

I want to ask that,

I didn't fully understand the design of your cross view loss. As far as I understand, you extract features from six panoramic images and decode them as Gaussian parameters, then do unprojection through the masked images, followed by splatting. Then you calculate the loss between the rendering images and the original images, right? My questions are: How is the scale ensured in this process? And when calculating the loss, is it based on the original images or the masked images? Are images from adjacent time frames used?
Are Stage 1 and Stage 2 related? Does Stage 1 only train the 6D pose network and the 2D encoder? Then, according to my understanding, the entire process could also be completed through Stage 2 alone, as the task of Stage 1 is unrelated to Occupancy Estimation.

Looking forward to your reply with sincere anticipation.

GANWANSHUI commented 4 days ago

Hi,

the scale is learned from the 2D feature map, the loss is based on the masked image, we still did not use the adjacent time frames.
yes, the stage 1 is only for the training 6D pose network, which is then being used for the stage 2.

fishfuck commented 4 days ago

Thank u for your answer ! ! I have another question: Are the depth scales of the panoramic views all supervised only by the RGB images rendered from adjacent views? Is there anything else designed for this? Because it feels like such supervision would be very weak.

GANWANSHUI commented 4 days ago

Yes, the small overlap region is sufficient to provide the scale information for the training. We do not have other design.

GANWANSHUI / GaussianOcc

Some questions about understanding the paper #20