Does the term "hidden state of neighbor" refer to the latent representation of the neighboring image or the scene-level encoding embedding?
Additionally, when addressing view consistency, does the model output multiple-view results simultaneously or only one view?
Does the term "hidden state of neighbor" refer to the latent representation of the neighboring image or the scene-level encoding embedding? Additionally, when addressing view consistency, does the model output multiple-view results simultaneously or only one view?