MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.27k stars 242 forks source link

Settings for training on single-word utterances (as in lab speech) #690

Open rbennett24 opened 11 months ago

rbennett24 commented 11 months ago

Is your feature request related to a problem? Please describe. I'm having trouble getting the MFA to produce output TextGrids when training a new aligner on a (large set of) single-word utterances collected as laboratory speech. (I see from some Googling around that this may be a known issue: https://groups.google.com/g/mfa-users/c/G5VrtE24Vj0.) No problems arise during validation, though during training I get "No files were aligned, this likely indicates serious problems with the aligner" errors during the alignment part of the sat phase.

Describe the solution you'd like Documentation for recommended settings for training on sets of single-word utterances, and/or an option to invoke such settings as the defaults when validating corpus or doing model training + alignment.

Describe alternatives you've considered I've tried to align the same data by combining the single-word utterances from each speaker into larger files (~ 6 minutes total each), but this doesn't work either, and the same issues arise. I assume that's because these files are too long? Still need to play around with combining into a series of mid-sized files of 5-15 words each (for example).

Adjusting beam width does not resolve these issues (https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/configuration/global.html#global-options).

I also see that other people have addressed this issue by deleting files in \Documents\MFA; this does not resolve the issue for me.

mmcauliffe commented 11 months ago

Can you try running mfa segment with the latest 3.0 alpha? That should break the transcription into segments that can be aligned, and lab speech should be pretty easy to segment with basic VAD (but you can use the --speechbrain flag for a better VAD if it doesn't work).

The latest version of mfa segment does an extra step of trying to align as much of the transcript to the VAD-created segments so it should generate a file that you can then use for mfa align. I might incorporate this as the default behavior for longer files in the future, but since it hasn't been tested thoroughly outside of my own use cases, I figure it's better to generate an intermediate TextGrid file that can be inspected and aligned as necessary.

rbennett24 commented 11 months ago

Thanks Michael. I did upgrade to v3.0.0a4, and that fixed the no-output problem --- TextGrids are now being produced for the large set of single-word utterances. Still, the alignments are quite bad.

When you suggest running mfa segment, should I try that on the large set of extracted, single-word utterance .wav files, or on the small set longer reconstructed .wav files that are 5-10 minutes each, and contain only the target single word utterances concatenated with each other? Just checking.

mmcauliffe commented 11 months ago

The mfa segment call would be on the wav files that are 5-10 minutes long with lab file transcriptions, and this will produce TextGrid files for them that should have intervals for each word in the original transcription. After sanity checking them, you'd have to either make a new corpus with the wav/TextGrid pairs or use the generated TextGrid directory as the corpus directory and specify --audio_directory /path/to/original/corpus.

If you then run mfa align on these generated files, then it shouldn't be too bad. Though if you can show exactly how the alignments are bad, that'd be helpful. It might be worthwhile to try the --speechbrain flag for segment generation, or it might be that they aren't generating enough silence padding, which can be controlled by setting --close_th 1.0 or something higher (the default is 0.333 seconds, which half of which is used to pad segments, so setting it to 1.0 would result in 0.5 seconds of padding).

Let me know if that makes sense or anything needs clarification, I'm still working on docs (and the actual API) for explaining all this.

This might not be the root of your alignment quality problem, since those should be single word utterances. If you don't have any silence in any of the files, you might want to specify --initial_silence_probability 0.0 to make sure it isn't trying to insert silence initially (again, mostly guessing at possible causes of the poor alignment).

rbennett24 commented 11 months ago

Am I right that mfa segment requires a pretrained acoustic model already? When I try to call it without one, I get an error; but of course the issue to begin with is that I'm having trouble training a model. On v3.0.0a4 I'm now getting divide by zero errors when I try to train on these longer re-concatenated files ZeroDivisionError: float division by zero, even when playing with beam width, so I can't even train a bad model on those files.

I'll play around with some of the other suggestions in this thread as well, thanks.

mmcauliffe commented 11 months ago

Yes, it does require an acoustic model (sorry, missed the part that you were trying to train a model, not just align). If it's not a language with a pretrained available, you can try using mfa segment_vad but it the generated TextGrids just have "speech" intervals since it won't know what's in the intervals, but if it's a single word per utterance, you might be able to script something extra to put the words in.