Closed lyghter closed 3 years ago
Hi there,
Please see the last section of the readme file for new datasets.
Since I don't know your task, it's hard for me to give specific suggestions. Maybe you can start with using the same parameters as that of the SC recipe but just change the audio length to 640 at here, I assume you meant 64000 samples, not frames, so 64000 samples = 640 frames.
You might also need to make the batch size smaller to fit the input into GPU memory, if so, you could also decrease the learning rate and use scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=args.lr_patience, verbose=True) in traintest.py
. Anyway, for a new dataset, you need to search the hyperparameters, especially the batch size and learning rate.
I suggest using the AudioSet pretrained model.
-Yuan
Thanks. Could you give me a hint on how hyperparameters (freqm, timem, fstride, tstride and num_mel_bins) might affect on model accuracy and memory usage? Should num_mel_bins be equal to input_tdim?
Your suggestion for my dataset: 64000 samples -> target_length=640 Your settings for Speech Commands: 16000 samples -> target_length=128 (why not 160?)
Sorry, I am wrong, 64000 samples should be 64000/16000/0.01=400 frames if your sampling rate is 16kHz. freqm
and timem
have a relatively small impact on the performance, fstride
and tstride
do not depend on input audio length, so you can reuse our parameters. num_mel_bins
and input_tdim
are completely different and should not be the same, the first is the number of frequency bins of the spectrogram, the second is the frame number. They are the same for SC dataset because of coincidence.
target_length=100
should be sufficient for SC dataset, but we just use 128 to guarantee all audios are shorter than the target_length
.
Best, Yuan
Hello @YuanGongND. I am trying to train AST on a dataset, which is very similar to Speech Commands, but: