UARK-AICV / VLCAP

[ICIP 2022] VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning
https://ieeexplore.ieee.org/document/9897766
28 stars 5 forks source link

data Preparation #2

Closed chunhuizhang closed 1 year ago

chunhuizhang commented 2 years ago

thanks for this fantastic work too. i am new to the research topic, could you please share some common methods to generate the c3d video feature file (100*2048) and the _langfeature (100*100 of words) and _sent_feature (nsent * 512) if you are convenient

Kashu7100 commented 2 years ago

Thank you for your interests.

For the c3d video feature extraction, please refer to this repo. For the lang_feature, we apply CLIP for the center frame of a snippet and extracted top N correlated words from the training vocabulary. For the sent_feature, we also use the CLIP but only the text encoder. For each GT sentence, we apply CLIP text encoder to get CLIP embedding features.

Hope this helps!

chunhuizhang commented 2 years ago

many thanks, later i'll have a try

yueyue0401 commented 1 year ago

For C3D, there are some different architectures: https://github.com/vhvkhoa/SlowFast/blob/master/MODEL_ZOO.md Could you please share which architecture you used to extract the YouCook2 video feature? Or did you train a standard C3D model from scratch? Is it possible to share the c3d video feature, lang-feature, and sent-feature files? Thank you so much!

Kashu7100 commented 1 year ago

@yueyue0401 @chunhuizhang We publish the code for our new work: https://github.com/UARK-AICV/VLTinT In the repo, there are more details about how to extract features. I hope this is helpful to you.