Dealing with different audio lengths

Hi Yuan and everyone,

Thank you for your excellent work!

I have some questions and I will happy for some help.

I have audio datasets that contain different lengths (i.e, 1.5s~ to 15s~). I am trying to classify 4 classes.

I set that I will take audio up to 5 sec (input_tdim=512, mel_bins = 128) to obtain the same shape for input to the AST. I am using the pre-trained Audioset with the same pre-process (fbank), normalization, and loss 'BCEWithLogitsLoss' as in your code.

When I plotted to myself the filterbank, I saw that some of the samples contained zeros in half of the plot. I am wondering if the AST can actually learn from that zero pedding?

After training with 10-20 epochs, I achieved the highest accuracy of 60%, and I think the AST can achieve much more. I tried with different lengths and I played with the 'input_tdim' and 'mel_bins' parameters for maybe get any better results (unfortunately, I didn't see any improvement).

I am wondering if Is there any different approach to the representation of the lengths so that the AST can learn much better?

Thank you, Ohad

Hi there,

I have audio datasets that contain different lengths (i.e, 1.5s~ to 15s~). I am trying to classify 4 classes. I set that I will take audio up to 5 sec (input_tdim=512, mel_bins = 128) to obtain the same shape for input to the AST.

What's the mean length of the audio? If the majority is around 5s, then your setting is reasonable. Also please check the sampling rate of the audio by wav, sr = torchaudio.load(audio_path) and make sure sr is 16kHz.

loss 'BCEWithLogitsLoss' as in your code.

If your task is single-label classification, you could consider using cross-entropy loss, BCE loss is mainly for multi-label classification.

When I plotted to myself the filterbank, I saw that some of the samples contained zeros in half of the plot. I am wondering if the AST can actually learn from that zero pedding?

Zero padding is fine. Hopefully your audio's length is not a strong feature for classification, otherwise the model could just use the audio length info for classification and report over-optimistic results.

After training with 10-20 epochs, I achieved the highest accuracy of 60%, and I think the AST can achieve much more. I tried with different lengths and I played with the 'input_tdim' and 'mel_bins' parameters for maybe get any better results (unfortunately, I didn't see any improvement).

If your audio's sampling rate is 16kHz, then changing mel bins could only make things worse as the AST model is pretrained on 16kHz data. I felt that the most important hyperparamter might be the learning rate. Also, note that timem (how much time is being masked during training as augmentation) should be modified accordingly with audio_length, you can set it to 0 for preliminary experiments.

Regarding the performance, do you have an expectation on the accuracy (e.g., from a baseline model or how human could do on the task)?. Our training pipeline is not optimal for your task so you would need to tune the hyperparameters by yourself. Or, if you already have your own training pipeline, you could insert the AST model in. In general, if you plan to use our pipeline, I suggest to start with the ESC50 recipe and modifying based on that.

-Yuan

YuanGongND / ast

Dealing with different audio lengths #92