From the point of view of the $Loss$ function, the results of the model essentially depend on the result of the 2D inpainting model (since the mask region is optimized by $M'_i$, which only used in $λ_1*M_0$), is it right?
Since the $Loss$ function only optimizes the mask region of in-painted image (all image are used in to opimize the render quality of this view), how do you ensure the quality of other views’ mask region?
Hi, @chenj02. Thank you for your interest in our GScream!
Essentially, that's true. For the masked area, bidirectional cross-attention can help achieve smoother boundaries, but the reference view's RGBD is much more important for providing reference information.
Considering 3D constraints, our bidirectional cross-attention module applies regularization to the masked area. From the perspective of 2D supervision, we attempted to use perceptual loss or an additional learned discriminator to constrain the masked area in other views, which resulted in some improvements. However, we opted not to employ these 2D constraints since we found that the existing pipeline is already adequate for producing favorable results.
Great work! But I have a few questions: