MontrealCorpusTools / Montreal-Forced-Aligner

Command line utility for forced alignment using Kaldi
https://montrealcorpustools.github.io/Montreal-Forced-Aligner/
MIT License
1.26k stars 242 forks source link

--uses_speaker_adaptation off, but alignment still variable depending on set size #809

Open amo104 opened 1 month ago

amo104 commented 1 month ago

Hello,

I wanted some clarification on how MFA's alignment works. As I understand, there's a 2-step alignment process, where the second pass utilizes per-speaker features. I'm trying to universalize my process through MFA to see how I might use the log-likelihood information to find errors in transcriptions/errors in speech productions. (I won't be needing the actual TextGrids for anything.) While aligning with --uses_speaker_adaption off, --single_speaker on, and setting to the same seed every run, I'm finding that the log-likelihoods are still variable.

I ran a test on a wav set A, a wav set B, and a wav set C=A+B. It's unclear whether A/B or C is generally better or worse, but overall the log-likelihoods are somewhat variable when comparing the same file in one of the smaller sets to the same file in the larger set. When just aligning C twice the output log-likelihoods are also marginally different, but are stable to about 5 digits which is leading me to think this problem is related to the size of the data set.

image (left is the results of A and B, right is the results of C).

Is there anything more I could be doing to be getting as consistent of outputs as possible, regardless of data set size? Or is this variance fundamental to how MFA works? For my purposes, it's more important that I get the same result over and over rather than get the most fitting alignment, so I'm okay with turning off as many features as necessary. Thank you!

Specs: MFA 3.0.7, Windows 10 Education, 7,800 WAV files of pseudo-English non-words utilizing the english_us_arpa acoustic model and a custom dictionary

mmcauliffe commented 1 month ago

There's a couple of sources of variability that could be playing a role. First is that feature generation has dithering, which can be turned off via --dither 0, but a larger source is likely that you have both wave files in the same directory, correct? There is still CMVN that is calculated at the speaker level, so adding files would change the CMVN stats. If you put them in different directories then it should calculate CMVN stats per file which would then be consistent across datasets. If you already have them separated out, then dithering should be the only source of difference across runs.

amo104 commented 1 month ago

Great, it seems like implementing both of those changes is giving me the behavior I want, thank you!! So for the future, instead of using --single_speaker, the goal would be to have MFA treat every file as a separate speaker (either by separating into directories or passing a --speaker_characters that would lead to them all being split), correct?

mmcauliffe commented 1 month ago

Either splitting into directories or --speaker_characters 500 or something obscenely large would ensure the whole file name would be the "speaker" (and assuming that the file names are unique). I'll think about adding a flag for parsing the corpus as though every file is a unique speaker, since both the directory and speaker_character approaches a bit round about and it shouldn't be too much work. The only thing to decide is adding a boolean flag like --no_speakers or deprecating the current --single_speaker flag and adding a --speaker_mode flag with options like "auto"/"directory"/"default" vs "single" vs "none".