gabeur / mmt

Multi-Modal Transformer for Video Retrieval
http://thoth.inrialpes.fr/research/MMT/
Apache License 2.0
256 stars 41 forks source link

S3D code for extracting the motion feature #11

Closed lininglouis closed 3 years ago

lininglouis commented 3 years ago

Hi, thanks for sharing the code, Could you please share the S3D code you use for extracting the motion feature? Thanks. I only find a non-official S3D implementation at https://github.com/kylemin/S3D. I would appreciate your reply.

gabeur commented 3 years ago

Sorry, we cannot share the features extraction code. The checkpoint to extract the S3D features is available here.

lininglouis commented 3 years ago

Sorry, we cannot share the features extraction code. The checkpoint to extract the S3D features is available here.

Cool! It's enough. Thanks for your instant reply.

lininglouis commented 3 years ago

Hi, Gabeur, May I know the way you precompute the S3D features? According to the pentathlon challenge. "Frames are extracted at 10fps and processed in clips of 32 frames with a stride of 25 frames." pentathlon But i dont think you use this way, because the number of S3D features(1024 features) you calculate for each video is similar to the video duration(for example, a video of 11 seconds will have S3D features in the dimension of (11, 1024) in MMT.

I'm wondering how you sample and extract the S3D features. I tried two ways to extract S3D. Here is the result.

s3d result

The S3D I used is from model.(the d3d model you provided earlier is corrupted somehow, i cannot load the pretrained weights, so I switch to this S3D version)

As you can see, there still remains a gap. It could be the problem of the S3D model i used. It could also be the way I extract the S3D feature is different from yours. Could you give some advice? Thanks!

gabeur commented 3 years ago

This is how we precompute the s3d features: Each segment is 1 second long with no overlapping, the FPS is kept to 30. So each segment has 30-frames, the input size is 30x224x224x3, the output size of S3D is averaged to 1x1x1x1024.

Your results look pretty close. I think it is important to report the average results over several experiments to draw conclusions because there is important variation with respect to the random seed.

lininglouis commented 3 years ago

This is how we precompute the s3d features: Each segment is 1 second long with no overlapping, the FPS is kept to 30. So each segment has 30-frames, the input size is 30x224x224x3, the output size of S3D is averaged to 1x1x1x1024.

Your results look pretty close. I think it is important to report the average results over several experiments to draw conclusions because there is important variation with respect to the random seed.

Hi Gabeur, we use the way you suggested, and the performance of S3D feature is similar now. Thanks a lot!

But we met with some problems in terms of the audio features(vggish). There are two questions and hope you could help.

  1. In the h5 files, your provided under the folder of vid_feat_files/mult_h5, the data has keys of features.vggish, and features.audio. Is there any difference between those two features? Are they both used by the model?

  2. Did you use the default way to extract vggish features same as mentioned in the CE paper?

    vggish

I noticed that, according to CE paper or vggish tensorflow repo, the audio features should be parsed into non-overlapping 0.96s collections of frames. But in the MMT expert_timings.py, the expert_timing of vggish has feat_width of 1.0. It looks like you parse the audio features with 1.0s per collections of frames.

Since there is 0.04s difference, did you resample the data or align the vggish features? If so, may I know how the vggish feature was calculated? Please correct me if my understanding is not right.

Many thanks for your help!

gabeur commented 3 years ago

In the h5 files, your provided under the folder of vid_feat_files/mult_h5, the data has keys of features.vggish, and features.audio. Is there any difference between those two features? Are they both used by the model?

features.audio are the audio features extracted by the authors of CE. features.vggish are the audio features extracted by us. We only use the features.vggish audio features for the results reported in the paper.

Did you use the default way to extract vggish features same as mentioned in the CE paper?

For obtaining the features.vggish, we used the same approach as the authors of CE except that our window size is 1.0s