ecoxial2007 / LGVA_VideoQA

Language-Guided Visual Aggregation for Video Question Answering
4 stars 2 forks source link

code reproduction problem #3

Open bxwldljh opened 9 months ago

bxwldljh commented 9 months ago

hi, when I reproduce your code on Nextqa dataset, I find that the shape of cFeature (batch_size, num_frames, dim) is (64,16, 768.). But num_frame should be 64 for the dataset. So I want to know if the file text_features_blip_caption.h5 is not corrrect?

ecoxial2007 commented 9 months ago

Using BLIP to predict captions for all video frames is too computationally intensive, so we uniformly sample 16 frames and replicate them to make 64 frames (you can do the same or use 16-frame setting). Since the captions in very short video clips are almost identical, this has little impact on accuracy.

bxwldljh commented 9 months ago

I understand now, many thanks to you!