Closed Jxu-Thu closed 4 months ago
So typically only the final stage of speaker adapted training uses the full data set, you can see the default settings for training here: https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/configuration/acoustic_modeling.html#default-training-config-file.
The subsets used are:
The largest corpus that I've used for training data is the english_mfa model that has 2 million utterances.
I will say that there is diminishing returns to training on larger datasets here, because almost all of the training is done with less than 150,000 utterances. I would probably recommend creating a subset of your data that is highest quality in terms of transcription accuracy, noise, etc of like 1-2 million utterances and using that to train a model that then gets used to align the rest. It does also depend on your end goal, i.e., do you just need alignments, do you need a GMM-HMM model or is this just a stepping stone to train a DNN model, etc.
I find that If I have a dataset over 10,000,000 wav clips, the mfa train will be very slow (over 7 days on modern CPU machines). So I want to know, is the best way to do this: 1) first segment out a portion of the data, for example a few million records, for mfa training; 2) use the model trained by mfa to do inference on the remaining data. If this is necessary, I would like to ask if the official API currently supports it, or if users need to perform these operations manually?