Soldelli / MAD

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
MIT License
149 stars 3 forks source link

clip encoder details #13

Closed FromA2Z closed 10 months ago

FromA2Z commented 11 months ago

thank you for your work, it is very helpful to the open source community. Regarding clips, do you use visual_projection layers when extracting image features as in the clip source code? 1701873287001

Soldelli commented 10 months ago

Dear @FromA2Z the function we used to encode our frames is the following: link. This follows the official OpenAI CLIP implementation.