It seems that in your training/eval data there is only one 2048 2d feature and one 2048 3d feature for a sentence. But using the feature extractor in https://github.com/antoine77340/video_feature_extractor , it seems that there will be nx2048 features for a sentence (if the sentence is n seconds in duration for 2d, and approximately n/1.5 seconds for 3d). How do I aggregate nx2048 features into one 2048 feature as stated in your paper by using temporal max-pooling ? Just select the max value for each dimension ?
It seems that in your training/eval data there is only one 2048 2d feature and one 2048 3d feature for a sentence. But using the feature extractor in https://github.com/antoine77340/video_feature_extractor , it seems that there will be nx2048 features for a sentence (if the sentence is n seconds in duration for 2d, and approximately n/1.5 seconds for 3d). How do I aggregate nx2048 features into one 2048 feature as stated in your paper by using temporal max-pooling ? Just select the max value for each dimension ?