Soldelli / MAD

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
MIT License
149 stars 3 forks source link

CLIP text encoder #16

Closed shiqubianbuhui closed 3 months ago

shiqubianbuhui commented 3 months ago

A great work. I want to know how to extract text features in your h5 file. You said you use CLIP ViT B/32. But when I try to use the pretrained clip text encoder, the result is different from yours in the h5 file. So I want to know how to extract the text features. This is my way to extract the text features using ViT B/32: model,process=clip.load("ViT-B/32", device='cuda') token=clip.tokenize(text,context_length=77).cuda() cls=model.encode_text(token)