Soldelli / MAD

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
MIT License
147 stars 3 forks source link

Language feature extraction process #12

Closed bofang98 closed 9 months ago

bofang98 commented 10 months ago

Thanks for your great work! I am curious about how you extract language features for each audio description. After loading the data, I observe that the shape for each sentence feature is [wordlen, 512]. It is strange because for CLIP, it will generates one 512 dim feature for the whole sentence. And we can also get 77x512 dim output if we store the word-level features.

Since MAD-V2 has different audio descriptions with V1, so I think it is necessary for extracting the new version language features again.

Looking for your reply. Thank you very much.