Reference images used for conditioning

Sisso16 commented 1 week ago

Hi there, first of all nice work! Secondly, I wanted to use this model in a slightly different way and from the paper it seemed to me that it is possible to use one or more reference images during inference which are then used by the diffusion model for conditioning. However going through the code it seems to me that always only one image is used for conditioning as we have condition_index = [0] in run diffusion. I understand this would always be the case for the task of generating a video from a single image but already for the nvs_sparse_view this means that only one image among the available ones is being used for conditioning. Thanks for your help in advance!

Drexubery commented 1 week ago

Thank you! Here, condition_index = [0] means input one of the reference images into the CLIP Image encoder (depicted in the pipeline figure). The CLIP Image encoder extracts high-level semantic information from the input image. We have tested both using all of the reference images and using only one reference image as input to the CLIP Image encoder, and we found no difference in model performance. Therefore, we only used one input in this case.

Sisso16 commented 1 week ago

I see, but is there a way to easily tweak the code to use more reference images for conditioning?

Drexubery / ViewCrafter

Reference images used for conditioning #42