Closed DanielSWolf closed 3 years ago
You can, the training regime is largely based on the recipes in Kaldi for generating ASR systems, since the forced alignment is one part of training those systems. In general though, the architectures used in MFA are outdated compared to current state of the art. Most recipes in Kaldi involve setting up a GMM-HMM system and using the alignments to generate a DNN acoustic model that is more performant. Forced alignment is a lot easier of a task than large vocabulary speech recognition, and most papers that use MFA only use it as a preprocessing step in forced alignment and then build an additional system from the alignments.
With that said, I have been playing around with implementing some basic ASR functionalities with the mfa train_lm
and mfa transcribe
commands that uses the trained GMM-HMM acoustic models. Still very much iterating on them and tweaking them and more geared towards use cases that involve offline transcription of larger corpora rather than online use cases, so it's likely not exactly what you're looking for, but might be worth playing around with to see if it's useful for you.
Thanks for the explanation! I didn't realize the Kaldi recipes all used DNN architectures by now. I'll check out your transcribe
command!
I want to perform both speech recognition and forced alignment. Both tasks require an acoustic model. My understanding is that the MFA
train
command creates the same kind of GMM/HMM model as the various Kaldi recipes.