Closed HanielF closed 2 years ago
Hi, if you only utilize the region-based features for objects in multi frames, you may lose the information of relationships among different objects, which would lead to poor performances from the models based on global video features.
If you wound to derive the global video features from BUTD, you can first extract the global image features for multi frames from BUTD as this paper did (image-wise pooling of activations) https://www.cv-foundation.org/openaccess/content_cvpr_2016_workshops/w12/papers/Salvador_Faster_R-CNN_Features_CVPR_2016_paper.pdf, and then averagely pool them into global video features., which are, in turn, utilized to train your model.
Thanks you very much~
I notice that butd output a 'npz' file corresponding to a single image. When i want to extract video caption based on xmodaler, it requires a global video feature.
How to extract the final video feature from butd output of multi frames?
In MSRVTT dataset, I attempted to use topN objects which are voted by multi frames that extracted from a video uniformly. But captions of video is in poor quality. The BLUE of evaluate and test set only up to 0.6 and many
<UNK>
in captions.