YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.17k stars 221 forks source link

Inquiry Regarding Audio Spectrogram Transformer #128

Open Ingram-lin opened 7 months ago

Ingram-lin commented 7 months ago

I am a graduate student from China, and our team recently had the privilege of studying your article on the 'Audio Spectrogram Transformer'. We were truly impressed by the content and scope of your work, and it has sparked a great deal of interest within our team. Following our admiration for your research, we endeavored to replicate your work on the ESC-50 dataset. However, as we proceeded to fine-tune the model using our own dataset, we encountered several challenges. We would greatly appreciate your guidance and assistance in navigating these challenges.

1、Our dataset consists of 2400 samples, each audio clip is 4 seconds. We set the audio_length parameter to 400 and timen to 80. We replaced the labels while keeping the rest recipe consistent with ESC-50. We downloaded a pre-trained model from Audioset and followed the same process as replicating ESC-50. We are pleased with the final result; the accuracy can reach 0.9. However, what surprises us is that the average precision is only between 0.3 to 0.5. Why could this be?

2、We understand your work involves projecting spectrograms to embeddings. (If our understanding of your work is incorrect, please forgive us.) After fine-tuning the model, we process new speech data and aim to obtain the embeddings. Could you please guide us on how to do this?

3、For example, if we want to fine-tune a pre-trained model with an English dataset and then validate the fine-tuned model with a Chinese dataset, can we set the training set as the English dataset and the validation set as the Chinese dataset during the fine-tuning process?

YuanGongND commented 7 months ago

1、Our dataset consists of 2400 samples, each audio clip is 4 seconds. We set the audio_length parameter to 400 and timen to 80. We replaced the labels while keeping the rest recipe consistent with ESC-50. We downloaded a pre-trained model from Audioset and followed the same process as replicating ESC-50. We are pleased with the final result; the accuracy can reach 0.9. However, what surprises us is that the average precision is only between 0.3 to 0.5. Why could this be?

This could be many reasons, but I do not have time to debug (and do not have information). The possible reasons include ESC-50 is balanced, so acc is a good metric, your dataset might be imbalanced. So the model is biased to the majority class, in that case, you would need to turn on class balancing, etc. Acc is not a good measure when the dataset is not balanced.

2、We understand your work involves projecting spectrograms to embeddings. (If our understanding of your work is incorrect, please forgive us.) After fine-tuning the model, we process new speech data and aim to obtain the embeddings. Could you please guide us on how to do this?

If you wish to get the last layer embedding, you should let the model return x right after this line (the model needs to be a trained model).

https://github.com/YuanGongND/ast/blob/31088be8a3f6ef96416145c4b8d43c81f99eba7a/src/models/ast_models.py#L184

3、For example, if we want to fine-tune a pre-trained model with an English dataset and then validate the fine-tuned model with a Chinese dataset, can we set the training set as the English dataset and the validation set as the Chinese dataset during the fine-tuning process?

That is totally possible, you can simply prepare two dataset and replace the original dataset in our recipe. But you might lose some performance due to the training / test mis-match. This will be true for all models, not just AST.

-Yuan

Ingram-lin commented 6 months ago

Thank you very much for your help. I have a new question to ask you. I want to extract 768-dimensional features from new speech data. I found this feature extraction method (as shown in the figure below). I would like to ask, how can I perform feature extraction using my own fine-tuned model? Due to my limited technical skills, I don't know how to proceed. Can you help me? 1715327148907(1)