Audio Event Classification resulting tensor has all negative values

rehana-mahfuz commented 1 year ago

Using the following pretrained model for audio tagging (based on the AudioSet ontology):

Pretrained Model Pretrain Data Finetune Data Performance CAV-MAE-Scale+ AudioSet-2M (multi-modal) AudioSet-2M (audio) 46.6 mAP

when I get the result from CAVMAEFT using mode='audioonly', the entire tensor of length 527 has negative values, and it doesn't seem to sum to 1. When I argsort and look at the top few tags (inspite of the negative), it seems somewhat correct (on unseen audio data).

Is getting an entire array of negative values (which doesn't sum to 1) expected?

YuanGongND commented 1 year ago

hi there,

First, if you are only interested in audio, rather than multi-modal, you could maybe start with the AST colab script, which is self-contained, and print the logits out: https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb

when I get the result from CAVMAEFT using mode='audioonly', the entire tensor of length 527 has negative values, and it doesn't seem to sum to 1.

It won't sum to 1 because it is a multi-label classification (each audio clip may contain more than 1 label). The model is fine-tuned with BCE loss (each class independent).

When I argsort and look at the top few tags (inspite of the negative), it seems somewhat correct (on unseen audio data).

This is OK, but a better method is to set a separate threshold for each class.

Is` getting an entire array of negative values (which doesn't sum to 1) expected?

This depends on the audio, I don't know if you have made everything correct. Can you just input your audio to the colab script above and see if the logits are all negative?

To check if you have everything set correctly, you should run a test on the AudioSet evaluation set and see if mAP matches with what we reported.

-Yuan

rehana-mahfuz commented 1 year ago

Thanks for pointing me to the demo script. Do you have thresholds for each class? Thanks!

YuanGongND commented 1 year ago

No, I don't. This depends on the data. But it would be fine to start with something easy, e.g., sort the logits and output the classes having the largest value (in the AST colab script, we did this), or using the same threshold for each class, etc.

If you wish to do this more carefully, you can use the model to inference on your validation set, observe the logits of each class and decide the threshold for each class.

rehana-mahfuz commented 1 year ago

Okay thanks.

YuanGongND commented 1 year ago

Again, if you are only interested in audio tagging, you could start with something simpler, e.g., https://huggingface.co/spaces/yuangongfdu/whisper-at.

The CAV-MAE is mainly for multi-modal applications, though it does have strong audio tagging performance.

YuanGongND / cav-mae

Audio Event Classification resulting tensor has all negative values #14