ecoxial2007 / LGVA_VideoQA

Language-Guided Visual Aggregation for Video Question Answering
4 stars 2 forks source link

About bbox_features in NeXt-clip-bbox-features.zip #12

Open xizewu96 opened 5 months ago

xizewu96 commented 5 months ago

Thank you very much for the publicly available source code and dataset.

I have two questions that I hope to receive your response to:

  1. In NeXt-clip-bbox-features. zip, the shape of each h5 file is (64, 2, 10, 768). I am curious what (2) and (10) represent? I see that in your model. py, I see that the author uses this in model.py: rFeature=item_dict ['bbox_features'] [:,:, 0,:,:]. So, could you explain the meaning referred to by (64, 0, 10, 768), (64, 1, 10, 768),)?

  2. The NExT-QA dataset seems to have a total of 5,440 videos, but there are 9,454 h5 files in both NeXt-clip-features and NeXt-clip-bbox-features files.、

Looking forward to and thank you very much for your reply!