Drexubery / ViewCrafter

Official implementation of "ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis"
Apache License 2.0
782 stars 24 forks source link

Questions about training #30

Open xiyichen opened 1 day ago

xiyichen commented 1 day ago

Nice work! I'm trying to reproduce your training code by modifying the code from DynamiCrafter.

If I understand correctly, the camera parameters are not passed into the diffusion as additional conditionings, and the only information the network has about the viewpoints is the point cloud renderings. Please correct me if I got that wrong.

I still have a few questions about training:

  1. How do you sample the 25 frames for each data sample during training? Are you randomly sampling 25 consecutive frames for each scene, or do you use some other strategies? Do you also use dust3r to construct point clouds from the 25 frames during training, or do you use any other source of point clouds?
  2. When training the model for single-view NVS, did you use any captions?
  3. Do you use fps control as in DynamiCrafter for this work? If the scene is static (i.e., no motion), isn't the fps information redundant in this case?

Looking forward to your response!

Drexubery commented 15 hours ago

Hi, thanks for your interest in our work! Yes, it's quite easy to reproduce the training code using that of DynamiCrafter; the only modifications needed are to the training data and some data loader scripts.

  1. As noted in the paper, we randomly sample 25 consecutive frames for each scene and use DUSt3R to reconstruct a globally aligned point cloud from all 25 frames. We then randomly select one or more frames from the point cloud, dropping the others, and render it using the camera poses previously generated by DUSt3R to obtain a point cloud render result.
  2. We use a fixed caption: "Rotating view of a scene."
  3. We set the FPS to 10 during both training and inference.
xiyichen commented 13 hours ago

Thanks for your quick reply! There's one part that I don't quite understand:

We then randomly select one or more frames from the point cloud, dropping the others, and render it using the camera poses previously generated by DUSt3R to obtain a point cloud render result.

If I understand correctly, Dust3r gives 25 point clouds, one for each frame, and all of them are globally aligned in the world space with the recovered camera poses. In each training clip, do you always use a random one from the 25 point clouds and render view-dependent point cloud images for all 25 views, or do you randomly sample from a few of the 25 point clouds and render them?

If you only use point clouds for one view, the renders could be bad if the viewpoint deviation is large. Have you tried to render point clouds merged from all frames?