Closed chunhuizhang closed 1 year ago
Thank you for your interests.
For the c3d video feature extraction, please refer to this repo. For the lang_feature, we apply CLIP for the center frame of a snippet and extracted top N correlated words from the training vocabulary. For the sent_feature, we also use the CLIP but only the text encoder. For each GT sentence, we apply CLIP text encoder to get CLIP embedding features.
Hope this helps!
many thanks, later i'll have a try
For C3D, there are some different architectures: https://github.com/vhvkhoa/SlowFast/blob/master/MODEL_ZOO.md Could you please share which architecture you used to extract the YouCook2 video feature? Or did you train a standard C3D model from scratch? Is it possible to share the c3d video feature, lang-feature, and sent-feature files? Thank you so much!
@yueyue0401 @chunhuizhang We publish the code for our new work: https://github.com/UARK-AICV/VLTinT In the repo, there are more details about how to extract features. I hope this is helpful to you.
thanks for this fantastic work too. i am new to the research topic, could you please share some common methods to generate the c3d video feature file (100*2048) and the _langfeature (100*100 of words) and _sent_feature (nsent * 512) if you are convenient