MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.31k stars 244 forks source link

Recovering words that get turned into [bracketed] #754

Closed kalvinchang closed 7 months ago

kalvinchang commented 7 months ago

Is your feature request related to a problem? Please describe. Is there a way to recover words that get turned into [bracketed]?

I am using brackets to store the speaker id, which we need to use when segmenting our clips into single-speaker utterances. I am also using brackets to store dysfluencies, but this is lost in the aligned TextGrid.

Describe the solution you'd like A way to recover words that get normalized into [bracketed] OR a way to pass annotations into MFA.

Describe alternatives you've considered My current workaround is to not mark dysfluencies in brackets and to wrap the speaker id in //, which MFA ignores.

mmcauliffe commented 7 months ago

Speaker information should specified either based on the directory of the wav/lab file or in TextGrid tiers rather than having a "word" in the transcription that specifies it, since speaker information is used for calculating CMVN and feature space transforms. See https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/corpus_structure.html for how the corpus should be structured, and then remove the speaker tag from the transcripts so that it's not inserting an unneeded OOV.

If you need the disfluencies or want to model them as subsequences of the following word, see https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/dictionary.html#modeling-cutoffs-and-hesitations. Let me know and reopen this if that doesn't match your use case.

kalvinchang commented 7 months ago

I totally missed the dysfluencies part. Thank you

kalvinchang commented 7 months ago

As for the speaker information, it's not possible in our use case to segment the entire conversation by speaker a priori. Each clip is an interview, and the transcripts are provided for the entire clip but there are no timestamps for when each speaker speaks. We're using MFA to get such timestamps to perform diarization. (We're using the Nationwide Speech Corpus.)