Open Sisso16 opened 1 week ago
Thank you! Here, condition_index = [0]
means input one of the reference images into the CLIP Image encoder (depicted in the pipeline figure). The CLIP Image encoder extracts high-level semantic information from the input image. We have tested both using all of the reference images and using only one reference image as input to the CLIP Image encoder, and we found no difference in model performance. Therefore, we only used one input in this case.
I see, but is there a way to easily tweak the code to use more reference images for conditioning?
Hi there, first of all nice work! Secondly, I wanted to use this model in a slightly different way and from the paper it seemed to me that it is possible to use one or more reference images during inference which are then used by the diffusion model for conditioning. However going through the code it seems to me that always only one image is used for conditioning as we have
condition_index = [0]
inrun diffusion
. I understand this would always be the case for the task of generating a video from a single image but already for the nvs_sparse_view this means that only one image among the available ones is being used for conditioning. Thanks for your help in advance!