Open milad-s5 opened 1 year ago
Hi there,
First,
Finally, I even used sox to resample the sounds to 16KHz in Python.
This is actually a very basic and mandotory thing to do because our model is pretrained with 16kHz data, any other sampling rate would cause the model totally fail.
Second, which script did you use to do inference? I recommend to use this: https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb#scrollTo=sapXfOwbhrzG and change the line:
sample_audio_path = 'https://www.dropbox.com/s/vddohcnb9ane9ag/LDoXsip0BEQ_000177.flac?dl=1'
to your audio path.
Third, if still fail, please use sox --i <filename>
to check your audio and paste the output here. The model should work at the best when your audio is ~10 seconds long as it is the training setting. Btw, what is the ground-truth label of your audio?
-Yuan
I have do the first and second stages and obtained correct results for an actual 2-second sound. Here is some information about the unsampled and sampled recorded sound. The ground truth for this recording is a machine gun.
So it is two channel audio, try do this:
sox sound-16k.wav leftchannel.wav remix 1
and sox sound-16k.wav rightchannel.wav remix 2
. Listen to both, and try again with AST model with the single channel audio.
-Yuan
There have been no changes, and both channels sound satisfactory. leftchannel:
rightchannel:
I have attached all of the sound files and will send you an email through Gmail for your review. Thank you.
It seems reasonable to me that sound effect is the predicted class for your audio. The limitation of this model is its training data, it might haven't listened to AK during training. "sound effect" class contains some continous explode sounds, which could be close to your audio. I suspect all models trained on AudioSet more or less having this behavior, but some other model might generalize better.
Practical suggestions:
-Yuan
Hello,
I have attempted to test your model on some recorded voices, but unfortunately, the results have not been satisfactory. Specifically, I have tested the model using samples from non-seen datasets such as the gunshot dataset from Kaggle, and in those cases, I achieved good results. However, when I recorded sounds using the microphone on my phone and passed them to the model, the results were incorrect. The voices themselves were clear when listened to with the human ear, and I even tried using different microphones to test the model. Additionally, I attempted to use various techniques such as amplifying the sounds and removing noise, but the results remained the same. Finally, I even used sox to resample the sounds to 16KHz in Python.
Could you please provide me with some assistance on how to address this issue?