YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.07k stars 204 forks source link

Binarizing output for each audio label in AudioSet(527 classes) #2

Closed anarsultani97 closed 2 years ago

anarsultani97 commented 3 years ago

Hi Yuan ,

First of all , I would like to say huge thanks for your great work !

It would be great if you can share more details about the output values in theReadme.md.

I run demo.py and I got the linear output values (positive and negative). I would like to know what is the best way to binarize those output values (0: audio label is absent , 1: audio label is present) ?

Anar Sultani

YuanGongND commented 3 years ago

Hi Anar,

Thanks for your interest.

Regarding your question, the output of AST is logits, you can use the sigmoid function to convert it to value in (0, 1). For a single-label classification problem, you can use argmax to get the binary label (i.e., the class with the largest logit is 1, otherwise 0); for a multi-label classification problem, you can first apply sigmoid to logits and pick a threshold (e.g., 0.5, or the mean value of the audio samples of the class in the validation set) to binarize the output.

-Yuan

YuanGongND commented 3 years ago

Btw I have updated the readme file to clarify the input and output of the AST model.

anarsultani97 commented 2 years ago

Hi Yuan ,

Thanks for quick and detailed reply and updating readme file.

I have looked at AudioSet validation set to calculate mean value of the audio samples of the class but there is not any probability value in the validation set , only positive labels defined segment-wise in the validation set. I think static threshold will not work for my case because each audio labels must have different threshold values to binarize them.

I would be happy to hear your opinions about this issue.

Anar

YuanGongND commented 2 years ago

Hi Anar,

Sorry I didn't make myself clear - I didn't mean using the pre-defined threshold provided by the dataset as the threshold should be model-dependent. What I mean is passing all validation samples to the model and get the outputs, and use the mean of the outputs to decide the threshold.

Let's say you have a validation set of 100 samples, and you are interested in the 'speech' class, in your validation set, there are 20 samples that are labeled as 'speech'. You can feed the 100 validation samples to the AST model, and the output of AST model would be [100, 527], since you are interested in the 'speech' class, you can get the score of 'speech' at index 0 (because the index of the speech class is 0). Then you have 100 scores, you can get the average score of the 20 samples that are labeled as speech, and the average score of the other 80 samples that are not labeled as speech. Let's say the first is 0.8, and the second is 0.1, then you probably can set 0.45 as the threshold.

The above is just an example, the threshold really depends on the false positive and false negative tradeoff. And also calculating standard mAP / AUC metrics doesn't require a binarization process. Setting the threshold is not a problem specific to audio tagging / AST, but of many machine learning problems.

-Yuan

anarsultani97 commented 2 years ago

Hi Yuan,

Thanks for the very detailed answer. I really appreciated it.

Best Regards Anar