ZHU-Zhiyu / NVS_Solver

Source code of paper "NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer"
189 stars 1 forks source link

Some questions about the details of your sampling code. #11

Closed bopan3 closed 1 week ago

bopan3 commented 1 week ago

Hello, I have some questions about the details of your sampling code.

  1. Why did you split the latent noise into 4 parts when calculating the gradient instead of using the whole thing? Is there any additional consideration for this?

  2. factor_s = 5.6 seems like a magic number to me. Could you explain its meaning and the reason for this particular setting?

mengyou2 commented 1 week ago

Hi, thanks for your interest in our project.

  1. The patch division is implemented to save computing memory, as we used a GPU with 40GB. If you have a 80GB GPU, you will not need to do this.

  2. In 'decode_latents', the latents are scaled by multiplying 1/scaling_factor: https://github.com/ZHU-Zhiyu/NVS_Solver/blob/88e9662f2ab0c75880bbdb3fcb98e2fd9b97167c/svd_interpolate_single_img.py#L222-L226

The scaling_factor in the config from svd is 0.18215 (https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/blob/main/vae/config.json), so 1/scaling_factor is 5.49. Our method enforces the denoised masked latent output should near to the temp_cond_latents. Thus, we scale temp_cond_latents by the dividing 1/scaling_factor, so when decoding the optimized latents, the output will be the original image. To further guarantee the experimental result, we did an experiment by first encoding temp_cond_latents and then scale it by factor_s near 5.49 (5.0-6.0, step=0.01), then decoding the scaled temp_cond_latents to image and calculate the PSNR with the original image to find the most proper factor_s, i.e., 5.6.

bopan3 commented 1 week ago

Thanks for your detailed explanation!