Performance issues with recorded voices

YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

BSD 3-Clause "New" or "Revised" License

1.13k stars 212 forks source link

Performance issues with recorded voices #96

Open milad-s5 opened 1 year ago

milad-s5 commented 1 year ago

Hello,

I have attempted to test your model on some recorded voices, but unfortunately, the results have not been satisfactory. Specifically, I have tested the model using samples from non-seen datasets such as the gunshot dataset from Kaggle, and in those cases, I achieved good results. However, when I recorded sounds using the microphone on my phone and passed them to the model, the results were incorrect. The voices themselves were clear when listened to with the human ear, and I even tried using different microphones to test the model. Additionally, I attempted to use various techniques such as amplifying the sounds and removing noise, but the results remained the same. Finally, I even used sox to resample the sounds to 16KHz in Python.

Could you please provide me with some assistance on how to address this issue?

YuanGongND commented 1 year ago

Hi there,

First,

Finally, I even used sox to resample the sounds to 16KHz in Python.

This is actually a very basic and mandotory thing to do because our model is pretrained with 16kHz data, any other sampling rate would cause the model totally fail.

Second, which script did you use to do inference? I recommend to use this: https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb#scrollTo=sapXfOwbhrzG and change the line:

sample_audio_path = 'https://www.dropbox.com/s/vddohcnb9ane9ag/LDoXsip0BEQ_000177.flac?dl=1'

to your audio path.

Third, if still fail, please use sox --i <filename> to check your audio and paste the output here. The model should work at the best when your audio is ~10 seconds long as it is the training setting. Btw, what is the ground-truth label of your audio?

-Yuan

milad-s5 commented 1 year ago

I have do the first and second stages and obtained correct results for an actual 2-second sound. Here is some information about the unsampled and sampled recorded sound. The ground truth for this recording is a machine gun.

YuanGongND commented 1 year ago

So it is two channel audio, try do this:

sox sound-16k.wav leftchannel.wav remix 1 and sox sound-16k.wav rightchannel.wav remix 2. Listen to both, and try again with AST model with the single channel audio.

-Yuan

milad-s5 commented 1 year ago

There have been no changes, and both channels sound satisfactory. leftchannel:

rightchannel:

I have attached all of the sound files and will send you an email through Gmail for your review. Thank you.

YuanGongND commented 1 year ago

It seems reasonable to me that sound effect is the predicted class for your audio. The limitation of this model is its training data, it might haven't listened to AK during training. "sound effect" class contains some continous explode sounds, which could be close to your audio. I suspect all models trained on AudioSet more or less having this behavior, but some other model might generalize better.

Practical suggestions:

If you aim to get a high number on a closed set (e.g., for a Kaggle challenge), try finetune the model (see ESC50 recipe) with your data, this would dramatically improve the number, but the model's performance on real-world data is not guarenteed.
If you aim to build a app or some software for real-world usage. Try merging the scores of classes that are close to gunshot (or other classes you are interested in), e.g., merging sound effect to the gunshot class. See https://github.com/YuanGongND/ast/blob/master/egs/audioset/data/class_labels_indices.csv for the full label set. The implementation would be relatively easy.

-Yuan