Closed rehana-mahfuz closed 1 year ago
hi there,
First, if you are only interested in audio, rather than multi-modal, you could maybe start with the AST colab script, which is self-contained, and print the logits out: https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb
when I get the result from CAVMAEFT using mode='audioonly', the entire tensor of length 527 has negative values, and it doesn't seem to sum to 1.
It won't sum to 1 because it is a multi-label classification (each audio clip may contain more than 1 label). The model is fine-tuned with BCE loss (each class independent).
When I argsort and look at the top few tags (inspite of the negative), it seems somewhat correct (on unseen audio data).
This is OK, but a better method is to set a separate threshold for each class.
Is` getting an entire array of negative values (which doesn't sum to 1) expected?
This depends on the audio, I don't know if you have made everything correct. Can you just input your audio to the colab script above and see if the logits are all negative?
To check if you have everything set correctly, you should run a test on the AudioSet evaluation set and see if mAP matches with what we reported.
-Yuan
Thanks for pointing me to the demo script. Do you have thresholds for each class? Thanks!
No, I don't. This depends on the data. But it would be fine to start with something easy, e.g., sort the logits and output the classes having the largest value (in the AST colab script, we did this), or using the same threshold for each class, etc.
If you wish to do this more carefully, you can use the model to inference on your validation set, observe the logits of each class and decide the threshold for each class.
Okay thanks.
Again, if you are only interested in audio tagging, you could start with something simpler, e.g., https://huggingface.co/spaces/yuangongfdu/whisper-at.
The CAV-MAE is mainly for multi-modal applications, though it does have strong audio tagging performance.
Using the following pretrained model for audio tagging (based on the AudioSet ontology):
Pretrained Model Pretrain Data Finetune Data Performance CAV-MAE-Scale+ AudioSet-2M (multi-modal) AudioSet-2M (audio) 46.6 mAP
when I get the result from CAVMAEFT using mode='audioonly', the entire tensor of length 527 has negative values, and it doesn't seem to sum to 1. When I argsort and look at the top few tags (inspite of the negative), it seems somewhat correct (on unseen audio data).
Is getting an entire array of negative values (which doesn't sum to 1) expected?