Thanks for your great work!
I am curious about how you extract language features for each audio description.
After loading the data, I observe that the shape for each sentence feature is [wordlen, 512]. It is strange because for CLIP, it will generates one 512 dim feature for the whole sentence. And we can also get 77x512 dim output if we store the word-level features.
Since MAD-V2 has different audio descriptions with V1, so I think it is necessary for extracting the new version language features again.
Thanks for your great work! I am curious about how you extract language features for each audio description. After loading the data, I observe that the shape for each sentence feature is [wordlen, 512]. It is strange because for CLIP, it will generates one 512 dim feature for the whole sentence. And we can also get 77x512 dim output if we store the word-level features.
Since MAD-V2 has different audio descriptions with V1, so I think it is necessary for extracting the new version language features again.
Looking for your reply. Thank you very much.