ariesssxu / vta-ldm

Apache License 2.0
38 stars 2 forks source link

CLIP4CLIP video encoder #3

Closed IFICL closed 1 month ago

IFICL commented 1 month ago

Hi, I have some questions regarding the CLIP4CLIP encoder you are using. From what I saw from the summary.jsonl, the fea_encoder_name is "openai/clip-vit-large-patch14" which is clip embedding. Need clarifications for the video encoder. How do you encode the video features? Do you actually use CLIP to get embedding pre frame and then concat them?

ariesssxu commented 1 month ago

For this basic model, YES. This is actually an interesting problem, as the embedding of vision features significantly influences the generation regardless of their own capability. We reference the CLIP4Clip framework in our design (so it's labelled CLIP4CLIP), and when removing the Position Embedding and setting the patch-num to 1 per frame, it results in the current structure utilized in our basic model.

IFICL commented 1 month ago

Thanks for the clarification.