Open bxwldljh opened 9 months ago
Using BLIP to predict captions for all video frames is too computationally intensive, so we uniformly sample 16 frames and replicate them to make 64 frames (you can do the same or use 16-frame setting). Since the captions in very short video clips are almost identical, this has little impact on accuracy.
I understand now, many thanks to you!
hi, when I reproduce your code on Nextqa dataset, I find that the shape of cFeature (batch_size, num_frames, dim) is (64,16, 768.). But num_frame should be 64 for the dataset. So I want to know if the file
text_features_blip_caption.h5
is not corrrect?