[Question] What is the best practice for mfa train if I have a very very large dataset.

So typically only the final stage of speaker adapted training uses the full data set, you can see the default settings for training here: https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/configuration/acoustic_modeling.html#default-training-config-file.

The subsets used are:

Monophone uses 10,000 utterances.
Triphone, LDA, and first pass of SAT use 20,000 utterances.
Second pass SAT and first pass pronunciation modeling use 50,000 utterances.
Third pass SAT and second pass pronunciation modeling use 150,000 utterances.
Fourth pass SAT uses all utterances.

The largest corpus that I've used for training data is the english_mfa model that has 2 million utterances.

I will say that there is diminishing returns to training on larger datasets here, because almost all of the training is done with less than 150,000 utterances. I would probably recommend creating a subset of your data that is highest quality in terms of transcription accuracy, noise, etc of like 1-2 million utterances and using that to train a model that then gets used to align the rest. It does also depend on your end goal, i.e., do you just need alignments, do you need a GMM-HMM model or is this just a stepping stone to train a DNN model, etc.

MontrealCorpusTools / Montreal-Forced-Aligner

[Question] What is the best practice for mfa train if I have a very very large dataset. #769