Closed bopan3 closed 1 week ago
Hi, thanks for your interest in our project.
The patch division is implemented to save computing memory, as we used a GPU with 40GB. If you have a 80GB GPU, you will not need to do this.
In 'decode_latents', the latents are scaled by multiplying 1/scaling_factor: https://github.com/ZHU-Zhiyu/NVS_Solver/blob/88e9662f2ab0c75880bbdb3fcb98e2fd9b97167c/svd_interpolate_single_img.py#L222-L226
The scaling_factor in the config from svd is 0.18215 (https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt/blob/main/vae/config.json), so 1/scaling_factor is 5.49. Our method enforces the denoised masked latent output should near to the temp_cond_latents
. Thus, we scale temp_cond_latents
by the dividing 1/scaling_factor, so when decoding the optimized latents, the output will be the original image. To further guarantee the experimental result, we did an experiment by first encoding temp_cond_latents
and then scale it by factor_s
near 5.49 (5.0-6.0, step=0.01), then decoding the scaled temp_cond_latents
to image and calculate the PSNR with the original image to find the most proper factor_s
, i.e., 5.6.
Thanks for your detailed explanation!
Hello, I have some questions about the details of your sampling code.
Why did you split the latent noise into 4 parts when calculating the gradient instead of using the whole thing? Is there any additional consideration for this?
factor_s = 5.6
seems like a magic number to me. Could you explain its meaning and the reason for this particular setting?