Using the same model for forced alignment and speech recogintion

DanielSWolf commented 3 years ago

I want to perform both speech recognition and forced alignment. Both tasks require an acoustic model. My understanding is that the MFA train command creates the same kind of GMM/HMM model as the various Kaldi recipes.

Does this mean that I can train an acoustic model using MFA, then use the same model for speech recognition?
If so, does it make sense to do this? Or does a good speech recognition model require different training settings than a good forced-alignment model, so that a model that performs well in one discipline would perform poorly in the other?

mmcauliffe commented 3 years ago

You can, the training regime is largely based on the recipes in Kaldi for generating ASR systems, since the forced alignment is one part of training those systems. In general though, the architectures used in MFA are outdated compared to current state of the art. Most recipes in Kaldi involve setting up a GMM-HMM system and using the alignments to generate a DNN acoustic model that is more performant. Forced alignment is a lot easier of a task than large vocabulary speech recognition, and most papers that use MFA only use it as a preprocessing step in forced alignment and then build an additional system from the alignments.

With that said, I have been playing around with implementing some basic ASR functionalities with the mfa train_lm and mfa transcribe commands that uses the trained GMM-HMM acoustic models. Still very much iterating on them and tweaking them and more geared towards use cases that involve offline transcription of larger corpora rather than online use cases, so it's likely not exactly what you're looking for, but might be worth playing around with to see if it's useful for you.

DanielSWolf commented 3 years ago

Thanks for the explanation! I didn't realize the Kaldi recipes all used DNN architectures by now. I'll check out your transcribe command!

MontrealCorpusTools / Montreal-Forced-Aligner

Using the same model for forced alignment and speech recogintion #315