YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.07k stars 205 forks source link

Clarification on the Parameters #25

Closed Jozdien closed 2 years ago

Jozdien commented 2 years ago

Hey,

I'm pretty new to working with audio data in classification, so could you give some insight into some of the parameters / stats mentioned in steps 2 - 4 in the "Use Pretrained Model For Downstream Tasks" section? Specifically, a bit more clarification on getting the normalization stats, and how the parameters in steps 2 (SpecAug and mixup rate) and 4 need to be changed for different kinds of input or how they affect the model.

YuanGongND commented 2 years ago

Hi there,

For normalization stats, it is just the mean / std of the spectrogram of all samples in the dataset. You can check this issue. Correct normalization is crucial, if your task is AudioSet, you can just use our norm stats, i.e., (-4.25 mean and 4.57 std).

For SpecAug and Mixup rate, these are hyperparameters that you need to search for your task. You can check our PSLA paper Section IV.B and IV.C for details. You can also set both as 0 for your first model. They won't dramatically change the model performance.

-Yuan

Jozdien commented 2 years ago

I see, thank you for the explanation!