Open cyrusvahidi opened 12 months ago
Ok I managed to reproduce:
Zeroshot Classification Results: mean_rank: 2.7344 median_rank: 1.0000 R@1: 0.5925 R@5: 0.8981 R@10: 0.9525 mAP@10: 0.7200
Accuracy: 0.5925 over 1600 samples
It seems top-5 accuracy was reported in the paper. I was confused, as Section 4.3 of the paper states "We use top-1 accuracy as the metric."
.
Overview
I have attempted to reproduce the zeroshot classification results for ESC-50 outlined in the publication Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.
In the paper, zeroshot classification accuracy
(top-1)
for the best model (K2C aug) is reported at91.0%
. I assume that this is the630k-audioset-best.pt
checkpoint.60.2%
top-1
accuracy for the ESC-50 datasetReproduce
I use the set of 50 unique captions in the test dataset, which are found in the
text
attr of each example's json file, e.g."The sound of the crow"
.Here's the loader for
ESC-50
:And the zeroshot retrieval script:
Hopefully I am missing something significant?