Confidence Measure and Automatic Pronunciation Evaluation Using MFA

ziweizh commented 7 years ago

I am wondering if there is a way to use MFA to obtain phone-level alignment likelihoods. Say, if I have some audio files and transcriptions of non-native English speech. Then, the summation of MFA alignment likelihoods divided by the number of phones of the recognized words can probably be used as a distance measure in the deviation from the native reference model since the default acoustic and language model used in MFA are trained on native English speech.

mmcauliffe commented 7 years ago

Right, so I've looked into this a bit in the past, and the short answer is that there's no supported way to do it in Kaldi (see: https://groups.google.com/forum/#!topic/kaldi-help/yi9hyWxTPQQ, https://groups.google.com/forum/#!topic/kaldi-help/tBOrm7WBSf8, https://sourceforge.net/p/kaldi/discussion/1355348/thread/fd0f3c27/, etc).

The Kaldi maintainers obviously come at it from more of the ASR perspective rather than linguistics, so it might be that phone level log likelihoods would be informative for us, but it would require a fair bit of effort on my part to get working (though Dan Povey does outline a method that would get kinda close in one of those links).

tingyang01 commented 3 years ago

Hello, I also hope to get the confidence from Montreal, but I can not get anything. Let me know how I can get the score.

DanielSWolf commented 2 years ago

I'd love to get confidence scores, too. In the 3rd link provided by @mmcauliffe, Daniel Povey writes:

If you want the total log-likelihoods over the whole utterance, for each path, it's very easy. Programs like nbest-to-linear will give you this. But the tools are not really designed to keep track of the per-frame or per-phone log-likelihoods.

While I'd love to get confidence scores per frame, per phone, or at least per word, getting a score per utterance would definitely be better than nothing for me!

Is that something that might be added to MFA?

(And once there was a way to get utterance-level confidence, I imagine that word-level confidence values might be just one hack away: After forced alignment, one could split the audio on word boundaries, then treat each audio fragment along with its transcript word as a new utterance. Calculating the utterance-level confidence would then effectively yield word-level scores.)

DanielSWolf commented 2 years ago

This is the documentation of Kaldi's nbest-to-linear:

Takes as input lattices/n-bests which must be linear (single path); convert from lattice to up to 4 archives containing transcriptions, alignments, and acoustic and LM costs (note: use ark:/dev/null for unwanted outputs) Usage: nbest-to-linear [options] <nbest-rspecifier> <alignments-wspecifier> [<transcriptions-wspecifier> [<lm-cost-wspecifier> [<ac-cost-wspecifier>]]] e.g.: lattice-to-nbest --n=10 ark:1.lats ark:- | \ nbest-to-linear ark:1.lats ark,t:1.ali 'ark,t:|int2sym.pl -f 2- words.txt > text'

Unfortunately, I don't know how the files MFA creates map to the information required for this call. Is there a way to extract the required information from the output of MFA?

rcgale commented 2 years ago

Hi Daniel,

I'm trying out MFA as a way to more easily/reliably share results from some of my Kaldi models. As part of that effort, I was looking for the confidence scores myself. I really appreciate you linking the resources you found, since it helped me remember what some of the Kaldi binaries were called etc. Here's what I was able to find:

There's a file ~/Documents/MFA/[mycorpus]_pretrained_aligner/pretrained_aligner/alignment_log_likelihood.csv which contains what I believe would be the same per-utterance likelihoods DP was talking about
There's some decent insight into the Kaldi pipeline in ~/Documents/MFA/[mycorpus]_pretrained_aligner/pretrained_aligner/log/get_(phone|word)_ctm.##.log. It points to ali.*.ark and fsts.*.ark,scp files in its parent directory. Note that the commands in the log file have read/write specifiers like ark:-. This means the data flow is making heavy use of piping on stdin/stdout streams, which is very typical in Kaldi workflows (to an eye-bugging degree). Thankfully these logs are relatively clean, and you could run them in a linux shell if you simply combine the commands with |s

I like the optimism with your self-described hack for word-level scores, but keep in mind those computations would be a very crude approximation. The lattices used here are long sequences of conditional probabilities, and each word can be arrived at by an enormous number of paths, hence all the beam size and pruning hyperparameters in the configuration.

I had a nice chat with Mark Gales a few years ago at a conference, who was doing quite a bit of analysis on fine-grained confidence scores. He tried to answer some questions I had along these lines pertaining to Kaldi, and while he made it sound quite straightforward, I still haven't been able to make it happen myself. The CTM format—the same format MFA relies on—sometimes comes with a per-segment confidence scores (lattice-to-ctm-conf provides it), but the numbers are terribly unreliable. (Lots of artificial 100% confidence from what I've seen, but I'm typically looking at phoneme-level segments.) What I understood from that conversation is that the numbers need to be normalized with a reliable denominator, which can be estimated (somehow) from the overall training set.

Hope there's something helpful in here, and I'd be curious to hear any developments.

EDIT: quick little note as I'm playing around with these per-utterance scores coming from alignment_log_likelihood.csv: you'll probably want to divide the number in that sheet by the duration of your segments for a per-frame likelihood. It's plainly working for me like that, revealing all kinds of useful details (based on a quick listen to the best and worst segments in my data)

DanielSWolf commented 2 years ago

@rcgale Thanks for your message!

Since I did my posts, I also found the alignment_log_likelihood.csv files. Unless I'm mistaken, the code that generates them is a rather recent addition to the MFA. I haven't had a chance yet to get deeper into the matter because I'm currently stuck cleaning up my training corpus. With my current corpus (which has a surprising number of transcription errors), the reported log likelihoods seem to be all over the place.

rcgale commented 2 years ago

I'm still looking too. Quick note: I was mistaken about normalizing the numbers. They are normalized to duration within MFA. I think I'm just experiencing what you are, that they're all over the place.

li-henan commented 1 year ago

dear friends，have you found a way to Evaluate mfa？thanks very much if you can reply

li-henan commented 1 year ago

and have you also can get alignment_analysis.csv？thanks very much if you can reply

yzmyyff commented 5 months ago

I can't find alignment_log_likelihood.csv in MFA 3. does anyone know how it was generated?

MontrealCorpusTools / Montreal-Forced-Aligner

Confidence Measure and Automatic Pronunciation Evaluation Using MFA #25