Language feature extraction process

Thanks for your great work! I am curious about how you extract language features for each audio description. After loading the data, I observe that the shape for each sentence feature is [wordlen, 512]. It is strange because for CLIP, it will generates one 512 dim feature for the whole sentence. And we can also get 77x512 dim output if we store the word-level features.

Since MAD-V2 has different audio descriptions with V1, so I think it is necessary for extracting the new version language features again.

Looking for your reply. Thank you very much.

Soldelli / MAD

Language feature extraction process #12