facebookresearch / AudioMAE

This repo hosts the code and models of "Masked Autoencoders that Listen".
Other
547 stars 45 forks source link

Reproducing the downstream task performance #8

Open Sara-Ahmed opened 1 year ago

Sara-Ahmed commented 1 year ago

Thanks for the nice work. I have several questions that I would really appreciate your input on them.

1) I tried to reproduce the results of ESC-50 dataset using your shared pre-trained weights and the provided file for finetuning but I could get only 89.5% while the reported results are above 94%, any idea why the performance is less? Also, I put the pre-trained weights on the AST framework for finetuning and I got 89% accuracy.

2) Why there is a huge performance jump in the case of SID despite not using external speech dataset?

3) Lastly, The proposed paper is quite similar to MAE-AST, which is essentially using MAE for Audio nevertheless there is a huge gap in performance between MAE-AST and your reported results, what are your thoughts in that?

Thanks a lot,

yangyangshuyang commented 1 year ago

Excuse me, I am a beginner in the audio field, and I would like to ask you how to apply the pre-trained model to the esc-50 dataset. The audio length of the esc50 data set is 5 seconds, but in the visualize demo provided by the author, the input of the pre-training model seems to be 10s.When I use this model to reconstruct the audio in esc50, the regular spectrogram is full of noise points. Can you tell me how to use the pre-trained model to reconstruct and visualize the audio data in esc-50?