Closed MelikaBahmanabadi closed 3 years ago
You generally should not train with such long sequences, since it would use far too much memory + enormous amounts of training data + compute to train a model on long segments. Generally durations of 15-20 second segments are used for training.
It would be best to segment the audio in the second dataset. If that is not possible, try to combine both the shorter and longer duration datasets and train for a long time (at least a hundred or more epochs) to fit the long duration audio corpus properly.
Also, note that you can use a QuartzNet model trained on 16.7 second audio duration, and evaluate on 2-5 minute long clips with no issue. There may be modest word error rate increase without finetuning it.
You generally should not train with such long sequences, since it would use far too much memory + enormous amounts of training data + compute to train a model on long segments. Generally durations of 15-20 second segments are used for training.
It would be best to segment the audio in the second dataset. If that is not possible, try to combine both the shorter and longer duration datasets and train for a long time (at least a hundred or more epochs) to fit the long duration audio corpus properly.
Also, note that you can use a QuartzNet model trained on 16.7 second audio duration, and evaluate on 2-5 minute long clips with no issue. There may be modest word error rate increase without finetuning it.
Thanks
Hi, I used QuartNet155 and fine-tuned it with the Persian Common Voice dataset. the WER was 30%. Everything worked well and I saved it as a checkpoint. After a while, I restored the checkpoint. I fine-tuned it with another Persian data set that has a longer duration than 16.7 seconds (i.e., 2 minutes, 3minutes, and so on ). When I test the model with the long duration audio, the vocabulary disarranged and the model could not transcript any audio correctly. What's your idea for fine-tuning QuartNet155 with a long duration? Thanks