YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.13k stars 212 forks source link

Parameters for tuning #33

Closed lyghter closed 2 years ago

lyghter commented 2 years ago

Hello @YuanGongND. I am trying to train AST on a dataset, which is very similar to Speech Commands, but:

  1. Could you advise me, which params should I change?
  2. I have enough resources and I want to increase the accuracy. How can I do this?
YuanGongND commented 2 years ago

Hi there,

Please see the last section of the readme file for new datasets.

Since I don't know your task, it's hard for me to give specific suggestions. Maybe you can start with using the same parameters as that of the SC recipe but just change the audio length to 640 at here, I assume you meant 64000 samples, not frames, so 64000 samples = 640 frames.

You might also need to make the batch size smaller to fit the input into GPU memory, if so, you could also decrease the learning rate and use scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=args.lr_patience, verbose=True) in traintest.py. Anyway, for a new dataset, you need to search the hyperparameters, especially the batch size and learning rate.

I suggest using the AudioSet pretrained model.

-Yuan

lyghter commented 2 years ago

Thanks. Could you give me a hint on how hyperparameters (freqm, timem, fstride, tstride and num_mel_bins) might affect on model accuracy and memory usage? Should num_mel_bins be equal to input_tdim?

lyghter commented 2 years ago

Your suggestion for my dataset: 64000 samples -> target_length=640 Your settings for Speech Commands: 16000 samples -> target_length=128 (why not 160?)

YuanGongND commented 2 years ago

Sorry, I am wrong, 64000 samples should be 64000/16000/0.01=400 frames if your sampling rate is 16kHz. freqm and timem have a relatively small impact on the performance, fstride and tstride do not depend on input audio length, so you can reuse our parameters. num_mel_bins and input_tdim are completely different and should not be the same, the first is the number of frequency bins of the spectrogram, the second is the frame number. They are the same for SC dataset because of coincidence.

target_length=100 should be sufficient for SC dataset, but we just use 128 to guarantee all audios are shorter than the target_length.

Best, Yuan