YehLi / xmodaler

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).
Other
1.03k stars 111 forks source link

How to extract a global video feature based on butd? #34

Closed HanielF closed 2 years ago

HanielF commented 2 years ago

I notice that butd output a 'npz' file corresponding to a single image. When i want to extract video caption based on xmodaler, it requires a global video feature.

How to extract the final video feature from butd output of multi frames?

In MSRVTT dataset, I attempted to use topN objects which are voted by multi frames that extracted from a video uniformly. But captions of video is in poor quality. The BLUE of evaluate and test set only up to 0.6 and many <UNK> in captions.

image
winnechan commented 2 years ago

Hi, if you only utilize the region-based features for objects in multi frames, you may lose the information of relationships among different objects, which would lead to poor performances from the models based on global video features.

If you wound to derive the global video features from BUTD, you can first extract the global image features for multi frames from BUTD as this paper did (image-wise pooling of activations) https://www.cv-foundation.org/openaccess/content_cvpr_2016_workshops/w12/papers/Salvador_Faster_R-CNN_Features_CVPR_2016_paper.pdf, and then averagely pool them into global video features., which are, in turn, utilized to train your model.

HanielF commented 2 years ago

Thanks you very much~