YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Different train-(val/test) spectogram shape (recordings duration) #65

Open danihinjos opened 2 years ago

danihinjos commented 2 years ago

Hello! First of all, congratulations on your amazing work. I'm doing my MSc Thesis on audio classification (respiratory disease diagnosis from lung sounds). My main objective is to improve the current model's (based on PANNs) generalization ability, so your model might just be ideal for me.

However, I'm struggling with some things when adapting it to my case. I have 30s recordings, but the original approach -which I am trying to replicate- was to train on 5s while validating/evaluating from whole recordings. AST is supposed to accept variable shape spectograms, but I cannot seem to find a way to perform this experiment. Could you help me out here? :)

Thank you in advance!

YuanGongND commented 2 years ago

Hi there,

Thanks for your interest.

The transformer itself accepts variable-length input, but that requires some engineering (e.g., bucketing sequence with similar lengths). We didn't implement it in the code, so - what we meant in the paper was that AST can be applied to tasks with different lengths, but for each task, the input length is fixed.

Besides the above thing, there's another thing to note - training with 5s data and testing with 30s data would cause a mismatch that potentially hurt the performance. So I guess it might be better to trim your audios to a same length for both training and test set (e.g., 10s, input_tdim=1024), and do majority voting for your test set.

-Yuan

danihinjos commented 2 years ago

Thank you for your quick response!

I see, I was misunderstanding what the paper said then. I don't understand what you mean by 'majority voting' in my test set, but I'll just decide on an audio length for both sets and I'll go with it.

I just run my first test with 5s and the performance seems to be worse. I adjusted the LR and epochs for it to work properly, but I am still using my own pipeline (which contains only some PSLA techniques, not spectogram normalization, audio sampled at 4KHz instead of 16KHz…). I did that because I wanted a truthful comparison and I thought that, even if the pipeline wasn't perfect for AST, it would still be able to at least reach current performances. Do you believe that this context might not be beneficial for AST, and that's why it's not outperforming the current PANNs-based model? I would love to know your thoughts on this :)

danihinjos commented 2 years ago

I do have an additional doubt regarding spectogram normalization… In the code it seems that you are doing this normalization based on statistics (mean, std) extracted only from the training set. Is that correct or should the stats be from the whole dataset?

Also, are you normalizing over filter banks or over spectograms. I generate spectograms with torchaudio.transforms.MelSpectogram, would it be equivalent to just normalize these spectograms?

Thanks in advance!

YuanGongND commented 2 years ago

Hi Daniel,

There are a few things.

I don't understand what you mean by 'majority voting' in my test set, but I'll just decide on an audio length for both sets and I'll go with it.

Majority voting means - you crop your 30s audio to 6 5s audios, and get a prediction for each of them, then you can average the scores to get a single score for your 30s.

I just run my first test with 5s and the performance seems to be worse. I adjusted the LR and epochs for it to work properly, but I am still using my own pipeline (which contains only some PSLA techniques, not spectogram normalization, audio sampled at 4KHz instead of 16KHz…). I did that because I wanted a truthful comparison and I thought that, even if the pipeline wasn't perfect for AST, it would still be able to at least reach current performances. Do you believe that this context might not be beneficial for AST, and that's why it's not outperforming the current PANNs-based model? I would love to know your thoughts on this :)

Two things greatly hurt the AST performance. 1. The AST is pretrained with 16kHz AudioSet while PANNs is pretrained with 41kHz, and your data is 41kHz. The mismatch in sampling rate greatly impacts the performance of the pretrained model. 2. spectogram normalization. You should always use normalization for AST model, specifically, if you use AudioSet pretrained model, please do use the same norm stats with us (i.e., [-4.2677393, 4.5689974] for mean and std, respectively).

In the code it seems that you are doing this normalization based on statistics (mean, std) extracted only from the training set. Is that correct or should the stats be from the whole dataset?

For fair evaluation, we assume we have not seen any test data, including their statistics. So using the training data statistics is the correct way to do so. But practically, there won't be a difference.

If you use our AudioSet pretrained model for speech tasks, you can use the same norm stats with us.

Also, are you normalizing over filter banks or over spectograms. I generate spectograms with torchaudio.transforms.MelSpectogram, would it be equivalent to just normalize these spectograms?

Here is another mismatch, it appears that you are using spectrogram rather than fbank features. While it is not equivalent, I think you can just normalize your spectrogram, but you need to calculate the norm stats.

Overall, to get the full power of AST, I'd suggest downsampling your audio to 16k, using input_tdim=1024 and adapting one of our recipes (I suggest ESC-50) to your task.

-Yuan

danihinjos commented 2 years ago

Hey there!

Thank you so so much for all of your useful advices and insights. I have been trying different stuff and indeed in some cases AST outperforms my current model (not really when resampling at 16K and/or using audioset pretrain tho, but rather when adjusting hyperparams, switching to fbanks, normalizing them...).

I wanted to go a step further and try out your new knowledge distillation (CMKD) method, but I noticed the code is not published yet. Would there be any way that I could get access to the code (or maybe some hints on how to properly implement it)? I know you'll release it at some point but this month is really my last month to try stuff for my master thesis, so I would really appreciate it.

Thanks in advance!

YuanGongND commented 2 years ago

Hi Daniel,

I have been trying different stuff and indeed in some cases AST outperforms my current model (not really when resampling at 16K and/or using audioset pretrain tho, but rather when adjusting hyperparams, switching to fbanks, normalizing them...).

Input normalization is crucial if you use ImageNet pretrained model. I am a bit surprised that AudioSet pretraining and 16kHz resampling don't work though, I guess the reason could be some mismatch in normalization or something else between our pipeline. But if you don't use (our) AudioSet pretrained model, there's no need to resample to 16kHz.

I wanted to go a step further and try out your new knowledge distillation (CMKD) method, but I noticed the code is not published yet. Would there be any way that I could get access to the code (or maybe some hints on how to properly implement it)? I know you'll release it at some point but this month is really my last month to try stuff for my master thesis, so I would really appreciate it.

This is a Journal submission and needs a long time to get a decision, so I wouldn't recommend waiting, especially since you have a deadline for your thesis. We need institute approval for code release, which usually happens after the paper gets accepted.

I basically used a revised version of the PSLA CNN model (https://github.com/YuanGongND/psla) (remove the attention part, to make it a pure CNN model) as the teacher for the AST model and apply standard knowledge distillation. I think I provided all the implementation details in the paper. Realistically, you said "in some cases AST outperforms my current model", what is your current model? If it is a CNN model, you could try to use it as the teacher of the AST model (basically, use the soft prediction of your current model as the target of the AST, note the temperature setting). Does it sound reasonable?

Good luck with your thesis.

-Yuan

danihinjos commented 2 years ago

Hi there!

Input normalization is crucial if you use ImageNet pretrained model. I am a bit surprised that AudioSet pretraining and 16kHz resampling don't work though, I guess the reason could be some mismatch in normalization or something else between our pipeline. But if you don't use (our) AudioSet pretrained model, there's no need to resample to 16kHz.

Indeed, I was a bit surprised that audioset pretrain with data resampled at 16KHz didn't work. I'm now trying to use directly your recipe/pipeline in case there are some mismatches, but I'm having some issues with memory, data allocation, etc. Hope to crack it soon tho. For the moment, it seems that only using imagenet pretraining is best, but the improvements are not that high nor robust (some metrics increase whereas other decrease... and the medical domain is a sensitive one for classification). I will try to further tweak hyperparameters and to explore other methods (based on your implementation) used such as balanced sampling, mixup augmentation, ensembles..., which I didn't include before.

This is a Journal submission and needs a long time to get a decision, so I wouldn't recommend waiting, especially since you have a deadline for your thesis. We need institute approval for code release, which usually happens after the paper gets accepted.

I basically used a revised version of the PSLA CNN model (https://github.com/YuanGongND/psla) (remove the attention part, to make it a pure CNN model) as the teacher for the AST model and apply standard knowledge distillation. I think I provided all the implementation details in the paper. Realistically, you said "in some cases AST outperforms my current model", what is your current model? If it is a CNN model, you could try to use it as the teacher of the AST model (basically, use the soft prediction of your current model as the target of the AST, note the temperature setting). Does it sound reasonable?

I completely understand, it was a long shot. My current model is a CNN-Attention model based on PANNs, but it's much simpler than the architecture used on PSLA (it only has one attention head, it's not based on any standard CNN, it doesn't use imagenet pretrain...). My idea for my thesis was to: 1) test AST, expecting improvements; 2) optional: test PSLA CNN-Attention (replacing my current CNN-Attention model); 3: test knowledge distillation with AST and either my CNN-Attention model or PSLA's model. Does it sound right?

I will try to implement KD on my own, I hope I don't mess up the loss function or something like that. I just have some question though, does the teacher model needs to be already trained when using it through the KD training process? also, do the teacher and the student need to have the same input and the same hyperparameters through the KD training process?

Again, thank you so much for your time and dedication, and for your kind words!

YuanGongND commented 2 years ago

I just have some question though, does the teacher model needs to be already trained when using it through the KD training process?

We always use pretrained teacher because we hope the teacher is strong.

also, do the teacher and the student need to have the same input and the same hyperparameters through the KD training process?

We use the same input. They can have different hyperparameteres, but we keep them as consistent as possible in the paper, just want to prove that it is the model architecture, not the hyperparameter that leads to the performance improvement.