ZHU-Zhiyu / NVS_Solver

Source code of paper "NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer"
189 stars 1 forks source link

Question about Sampling #8

Closed xizaoqu closed 2 weeks ago

xizaoqu commented 3 weeks ago

Hi, thanks for sharing your great work. I have a little question about the implementation of the sampling part. Is the temp_cond_latents in step_in corresponding to the \mu in eqn. 13.

xizaoqu commented 3 weeks ago

Besides, I noticed that in the implementation, the optimization process is divided into 4 parts to do it separately. Is it out of the consideration of saving computing memories? Will it influence the results? (i.e. outputs some discontinuous artifacts in the boundary)

mengyou2 commented 3 weeks ago

Hi, thanks for your interest in our project. The uploaded code uses Posterior Sampling, corresponding to Eq. 14. Eq. 13 pertains to Directly Guided Sampling, and we will upload the code for this mode in a few days.

Yes, the patch division is implemented to save computing memory, as we used a GPU with 40GB. This should not significantly affect the results. During patch division, we ensured each patch overlapped to eliminate discontinuities.

xizaoqu commented 3 weeks ago

Hi, thanks for your interest in our project. The uploaded code uses Posterior Sampling, corresponding to Eq. 14. Eq. 13 pertains to Directly Guided Sampling, and we will upload the code for this mode in a few days. The variable temp_cond_latents is represented as μ~t in both Eq. 13 and Eq. 14.

Thanks for your reply. I made a typo in question 1 that Eq. 13 should be Eq. 12. But in the code temp_cond_latents is directly encoded by vae, not by Eq. 12?

mengyou2 commented 3 weeks ago

oh, i misunderstood your question.

In Equation 12, the $\hat{\mathbf{X}}$ corresponds to the warped image. And the temp_cond_latents represents the VAE encoded warped images. The other term in Equation 12, $\mu$ represents the predictions from the diffusion model. In Equation 12, $\widetilde\mu$ is a combination of $\hat{\mathbf{X}}$ and $\mu$ , which means that in some regions of the image, the pixels come from $\hat{\mathbf{X}}$ (the warped image), while in other regions, the pixels are directly generated from the diffusion model. In latent space, this means, in some regions of the image, the feature come from temp_cond_latents and other generated from diffusion model.

In the optimization part, i.e., Equation 14, the loss is computed between $\widetilde\mu$ and the target $\mu$. Since $\widetilde\mu$ is a combination of $\hat{\mathbf{X}}$ and $\mu$, in the implementation part, the loss is only computed on the regions where the warped image (temp_cond_latents in latent space) is used, and the other regions that are directly generated from the diffusion model are not included in the loss computation. In the step_single function, the variable λ in Equation 12 is used to compute the mask that represents the regions (top_masks) where the VAE encoding (temp_cond_latents) is used.

https://github.com/ZHU-Zhiyu/NVS_Solver/blob/a85bec261a17674fa7fa2f90719511d350c0420d/src/diffusers/schedulers/scheduling_euler_discrete.py#L596-L603

Hope this makes it clearer.

xizaoqu commented 2 weeks ago

Understood, thanks.