YapengTian / AVVP-ECCV20

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing, ECCV, 2020. (Spotlight)
77 stars 20 forks source link

The output shape of your Vggish script seems wrong. #9

Closed MaFuyan closed 2 years ago

MaFuyan commented 3 years ago

The shape of the embedding_batch generated by your script audio_feature_extractor.py is [10, 6 , 4, 512], which is different from the predefined shape [len_data, 10, 128] of the audio_features.

MaFuyan commented 3 years ago

Problem solved. The define_vgg_slim function in vggish_slim.py returns the unflatten feature net1 instead of the expected net.