MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.26k stars 242 forks source link

[Question] What is the best practice for mfa train if I have a very very large dataset. #769

Closed Jxu-Thu closed 4 months ago

Jxu-Thu commented 4 months ago

I find that If I have a dataset over 10,000,000 wav clips, the mfa train will be very slow (over 7 days on modern CPU machines). So I want to know, is the best way to do this: 1) first segment out a portion of the data, for example a few million records, for mfa training; 2) use the model trained by mfa to do inference on the remaining data. If this is necessary, I would like to ask if the official API currently supports it, or if users need to perform these operations manually?

mmcauliffe commented 4 months ago

So typically only the final stage of speaker adapted training uses the full data set, you can see the default settings for training here: https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/configuration/acoustic_modeling.html#default-training-config-file.

The subsets used are:

  1. Monophone uses 10,000 utterances.
  2. Triphone, LDA, and first pass of SAT use 20,000 utterances.
  3. Second pass SAT and first pass pronunciation modeling use 50,000 utterances.
  4. Third pass SAT and second pass pronunciation modeling use 150,000 utterances.
  5. Fourth pass SAT uses all utterances.

The largest corpus that I've used for training data is the english_mfa model that has 2 million utterances.

I will say that there is diminishing returns to training on larger datasets here, because almost all of the training is done with less than 150,000 utterances. I would probably recommend creating a subset of your data that is highest quality in terms of transcription accuracy, noise, etc of like 1-2 million utterances and using that to train a model that then gets used to align the rest. It does also depend on your end goal, i.e., do you just need alignments, do you need a GMM-HMM model or is this just a stepping stone to train a DNN model, etc.