Soldelli / MAD

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
MIT License
147 stars 3 forks source link

CLIP backbone #3

Closed fmu2 closed 1 year ago

fmu2 commented 1 year ago

Hi, thanks for the great work! Which CLIP backbone did you use for video/text feature extraction?

Soldelli commented 1 year ago

Dear @fmu2 we used CLIP B/32. We plan to release B/16 and L/14 early next year.