Open huge123 opened 3 months ago
I was also confused here but figured it out.
In the config file, video_length is set to 16 while num_queries is set to 16 for image_proj_stage_config, which means each condition image will be projected to 16*16=256 tokens. Then the first 77 tokens of l_context are text tokens, while the 256 tokens after them are image tokens. In contrast to text tokens which are repeated for each of the 16 frame, the image tokens already have temporal dimentsion(video_length) and are rearranged to (16, 16), corresponding to 16 tokens for each frame.
So essentially, the image projector is projecting the conditioning image's CLIP features into video features. It was trained to predict what kind of spatial-temporal patterns can exist from a single image. There is no need to repeat it by t again.
This can be inferred from here: Point 2 of this issue.
Having said that, I am also wondering why the author did not project into only image features (say, 16 queries) and then repeat it in temporal dimension. Does it make a difference? @Doubiiu
Hi @huge123 and @lzhangbj, yeah what @lzhangbj has said is correct. We intend to provide the learning space for the model to learn some temporal variations in a video. However, due to the limited training compute and temporal coherence quality (e.g. the max. video length we can hold), it just made slight difference and improvement (That is the reason why we didn't emphasize this arch in the main paper and just show it in the supplementary document). We hope this insight can inspire further research to some extent.
Thank you for the answer! It helped a lot.
https://github.com/Doubiiu/DynamiCrafter/blob/c453369367122d7fbb0aa38f124e76dc8fe2a91c/lvdm/modules/networks/openaimodel3d.py#L556
I think this code will rearrange the per-frame condition embeddings from (t l) to t l, but why the dimension of image condition is 16, I think the embedding dimension after img_proj_model should be 256. https://github.com/Doubiiu/DynamiCrafter/blob/c453369367122d7fbb0aa38f124e76dc8fe2a91c/scripts/evaluation/inference.py#L177
I think it should be if l_context == 77 +t*256, right? if I miss something?