Reproducibility of VGGSound

LAION-AI / CLAP

Contrastive Language-Audio Pretraining

https://arxiv.org/abs/2211.06687

Creative Commons Zero v1.0 Universal

1.43k stars 137 forks source link

Reproducibility of VGGSound #86

Open GenjiB opened 1 year ago

GenjiB commented 1 year ago

Thanks for sharing the amazing codebase. I am wondering if you will provide the script or other useful resources for reproducing results on VGGSound.

I tried to use theget_audio_embedding_from_filelist to get audio features. But I can only get ~45%, which is a huge gap between 75% (really impressive since A+V only gets 64.1%)

Looking forward to your reply.

lukewys commented 1 year ago

Hi, the 45% you got is from the zero-shot classification performance, similar to the 46.2% reported in our paper. The 75% is got from supervised fine-tuning the audio encoder. In such a case, you need to finetune on VGGSound in a supervised manner.

Best, Yusong

AkonLau commented 1 year ago

Excuse me！ I have the same trouble with the re-producing results of VGGSound on the zero-shot setting. I can just get the top-1 accuracy of about 28.4% by running the pre-trained model 630k-audioset-best.pt with the vanilla VGGSound list test.csv.

Zeroshot Classification Results: mean_rank: 25.6106 median_rank: 4.0000 R@1: 0.2841 R@5: 0.5516 R@10: 0.6609 mAP@10: 0.3979

I am wondering if you will provide the script or other useful resources for reproducing results on VGGSound.

Looking forward to your reply.

yljblues commented 10 months ago

Hello! I have the same problems on reproducing the results of zeroshot-classification of VGGSound dataset. With the checkpoint of 630k-audioset-best.pt, I got 29.83% top-1 accuracy on the test set of VGGSound.

Can you give some instructions for reproducing the results on VGGSound?

Many thanks to your reply in advance!