Drexubery / ViewCrafter

Official implementation of "ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis"
Apache License 2.0
859 stars 31 forks source link

Confused about the video pairs organization #38

Open everythoughthelps opened 1 week ago

everythoughthelps commented 1 week ago

This is a great work! you said "Then, we randomly select the constructed point cloud of the video frames and render it along the estimated camera trajectory using Pytorch3D." in Sec. 4.1, from my perspective, you want to build some frame pairs in this step right, but one thing that confuses me is, the hole frames are rendered from point cloud, which is constructed from all the frames, so the rendered frames and the real frames are supposed to have very little gap between them? Or you split the frames into two groups, say we have 25 frames, 10 frames are used to build the point cloud with dust3R, then we render the rest 15 frames and combine them with the 15 gt frames to construct the video pairs?

everythoughthelps commented 1 week ago

Another question, Next Best View you design is a good strategy to render the next view step by step, I was wondering, is it possible to obtain the same result by using the adjacent camera and frames aligning with the time line, say right the next frames of the ref images, will this also work?

Drexubery commented 6 days ago

This is a great work! you said "Then, we randomly select the constructed point cloud of the video frames and render it along the estimated camera trajectory using Pytorch3D." in Sec. 4.1, from my perspective, you want to build some frame pairs in this step right, but one thing that confuses me is, the hole frames are rendered from point cloud, which is constructed from all the frames, so the rendered frames and the real frames are supposed to have very little gap between them? Or you split the frames into two groups, say we have 25 frames, 10 frames are used to build the point cloud with dust3R, then we render the rest 15 frames and combine them with the 15 gt frames to construct the video pairs?

Thanks! We use all 25 frames to build the point cloud with Dust3r, and the rendered frames typically differ from the ground truth frames used for supervision. This allows the model to learn how to refine and correct the rendered frames.

everythoughthelps commented 6 days ago

This is a great work! you said "Then, we randomly select the constructed point cloud of the video frames and render it along the estimated camera trajectory using Pytorch3D." in Sec. 4.1, from my perspective, you want to build some frame pairs in this step right, but one thing that confuses me is, the hole frames are rendered from point cloud, which is constructed from all the frames, so the rendered frames and the real frames are supposed to have very little gap between them? Or you split the frames into two groups, say we have 25 frames, 10 frames are used to build the point cloud with dust3R, then we render the rest 15 frames and combine them with the 15 gt frames to construct the video pairs?

Thanks! We use all 25 frames to build the point cloud with Dust3r, and the rendered frames typically differ from the ground truth frames used for supervision. This allows the model to learn how to refine and correct the rendered frames.

Ok, I thought the rendered images had little gap with the gt images. Thank you.