Closed adeljalalyousif closed 10 months ago
As you've noted, the inference time in Table 3 corresponds to generating captions for individual videos, given that the batch size is set to 1 during testing.
IO and frame extraction are not factored into the table. Frame extraction isn't included because the reader isn't optimized for reading motion vectors and residuals.
For handling varying frame counts, we employ a uniform sampling of GOPs from the compressed video. In our experiments with the MSRVTT dataset, we sampled 8 GOPs, which is close to the average number for MSRVTT videos with keyint=60.
thanks
Thank you for sharing your code. Could you please provide additional details regarding the inference speed calculation in Fig. 2 and Table 3? I am a bit confused.
Regarding Table 3, where the inference time for your model is listed as 178 ms, could you specify if this time corresponds to generate caption for one video file ?
Additionally, I would appreciate clarification on whether the time costs of IO operations and frame extraction are excluded from these calculations.
Lastly, Lastly, the videos in MSRVTT have a different number of frames, so how was this issue addressed in Table 3? For your model, how many frames per video are considered?