cmusphinx / pocketsphinx

A small speech recognizer
Other
3.84k stars 710 forks source link

how to do pronunciation evaluation with pocketsphinx? #350

Closed YangangCao closed 1 year ago

YangangCao commented 1 year ago

Hi dear maintainer, I found this: https://cmusphinx.github.io/wiki/pocketsphinx_pronunciation_evaluation/ But the pocketsphinx_continous has disappeared, Can I use pocketsphinx_batch to achieve the same usage? or Can you give me a example to use pocketsphinx_batch, I even don't know how to set input file Thanks!

dhdaines commented 1 year ago

Hello, pocketsphinx_batch and pocketsphinx_continuous have been replaced with a program called pocketsphinx. For example, if you are in the top-level source directory have compiled it with the output in build:

$ ./build/pocketsphinx single -hmm model/en-us/en-us -lm model/en-us/en-us.lm.bin -dict model/en-us/cmudict-en-us.dict test/data/cards/001.wav 
{"b":0.000,"d":1.090,"p":0.027,"t":"ten of clubs","w":[{"b":0.000,"d":0.150,"p":1.000,"t":"<s>"},{"b":0.150,"d":0.190,"p":0.275,"t":"ten"},{"b":0.340,"d":0.110,"p":0.962,"t":"of"},{"b":0.450,"d":0.510,"p":0.516,"t":"clubs"},{"b":0.960,"d":0.120,"p":1.000,"t":"</s>"}]}
YangangCao commented 1 year ago

Hi thanks for your quick reply? I have tried pocketsphinx to do pronunciation evaluation, but it not works well, for example(this sentence is correct):

$ ./pocketsphinx -phone_align yes align 1.wav "these kids will be really disappointed if their trip gets cancelled"

{"b":0.000,"d":3.110,"p":1.000,"t":"these kids will be really disappointed if their trip gets cancelled","w":[{"b":0.000,"d":0.200,"p":0.990,"t":"<sil>","w":[{"b":0.000,"d":0.200,"p":0.990,"t":"SIL"}]},{"b":0.200,"d":0.230,"p":0.972,"t":"these","w":[{"b":0.200,"d":0.070,"p":0.994,"t":"DH"},{"b":0.270,"d":0.080,"p":0.992,"t":"IY"},{"b":0.350,"d":0.080,"p":0.986,"t":"Z"}]},{"b":0.430,"d":0.210,"p":0.952,"t":"kids","w":[{"b":0.430,"d":0.060,"p":0.981,"t":"K"},{"b":0.490,"d":0.080,"p":0.994,"t":"IH"},{"b":0.570,"d":0.040,"p":0.996,"t":"D"},{"b":0.610,"d":0.030,"p":0.981,"t":"Z"}]},{"b":0.640,"d":0.090,"p":0.963,"t":"will(2)","w":[{"b":0.640,"d":0.030,"p":0.973,"t":"W"},{"b":0.670,"d":0.030,"p":0.998,"t":"AH"},{"b":0.700,"d":0.030,"p":0.992,"t":"L"}]},{"b":0.730,"d":0.160,"p":0.974,"t":"be","w":[{"b":0.730,"d":0.040,"p":0.992,"t":"B"},{"b":0.770,"d":0.120,"p":0.982,"t":"IY"}]},{"b":0.890,"d":0.250,"p":0.957,"t":"really","w":[{"b":0.890,"d":0.070,"p":0.977,"t":"R"},{"b":0.960,"d":0.060,"p":0.997,"t":"IH"},{"b":1.020,"d":0.040,"p":0.989,"t":"L"},{"b":1.060,"d":0.080,"p":0.994,"t":"IY"}]},{"b":1.140,"d":0.600,"p":0.913,"t":"disappointed(2)","w":[{"b":1.140,"d":0.050,"p":0.992,"t":"D"},{"b":1.190,"d":0.040,"p":0.997,"t":"IH"},{"b":1.230,"d":0.080,"p":0.994,"t":"S"},{"b":1.310,"d":0.050,"p":0.996,"t":"AH"},{"b":1.360,"d":0.090,"p":0.995,"t":"P"},{"b":1.450,"d":0.150,"p":0.972,"t":"OY"},{"b":1.600,"d":0.030,"p":0.998,"t":"N"},{"b":1.630,"d":0.080,"p":0.980,"t":"IH"},{"b":1.710,"d":0.030,"p":0.985,"t":"D"}]},{"b":1.740,"d":0.120,"p":0.989,"t":"if","w":[{"b":1.740,"d":0.050,"p":0.994,"t":"IH"},{"b":1.790,"d":0.070,"p":0.995,"t":"F"}]},{"b":1.860,"d":0.120,"p":0.989,"t":"their","w":[{"b":1.860,"d":0.050,"p":0.996,"t":"DH"},{"b":1.910,"d":0.030,"p":0.997,"t":"EH"},{"b":1.940,"d":0.040,"p":0.996,"t":"R"}]},{"b":1.980,"d":0.240,"p":0.976,"t":"trip","w":[{"b":1.980,"d":0.080,"p":0.992,"t":"T"},{"b":2.060,"d":0.060,"p":0.996,"t":"R"},{"b":2.120,"d":0.040,"p":0.996,"t":"IH"},{"b":2.160,"d":0.060,"p":0.991,"t":"P"}]},{"b":2.220,"d":0.210,"p":0.961,"t":"gets","w":[{"b":2.220,"d":0.070,"p":0.984,"t":"G"},{"b":2.290,"d":0.060,"p":0.989,"t":"EH"},{"b":2.350,"d":0.040,"p":0.996,"t":"T"},{"b":2.390,"d":0.040,"p":0.992,"t":"S"}]},{"b":2.430,"d":0.670,"p":0.895,"t":"cancelled","w":[{"b":2.430,"d":0.070,"p":0.986,"t":"K"},{"b":2.500,"d":0.100,"p":0.994,"t":"AE"},{"b":2.600,"d":0.050,"p":0.997,"t":"N"},{"b":2.650,"d":0.070,"p":0.991,"t":"S"},{"b":2.720,"d":0.040,"p":0.997,"t":"AH"},{"b":2.760,"d":0.150,"p":0.985,"t":"L"},{"b":2.910,"d":0.190,"p":0.941,"t":"D"}]}]}

if I change another totally wrong sentence to align the same audio file, I also get the high "p" result, it's not what I want:

$ ./pocketsphinx -phone_align yes align 1.wav "hello thank you thank you very much"

{"b":0.000,"d":3.110,"p":1.000,"t":"hello thank you thank you very much","w":[{"b":0.000,"d":0.260,"p":0.956,"t":"<sil>","w":[{"b":0.000,"d":0.260,"p":0.956,"t":"SIL"}]},{"b":0.260,"d":0.330,"p":0.792,"t":"hello","w":[{"b":0.260,"d":0.140,"p":0.940,"t":"HH"},{"b":0.400,"d":0.120,"p":0.923,"t":"AH"},{"b":0.520,"d":0.030,"p":0.948,"t":"L"},{"b":0.550,"d":0.040,"p":0.963,"t":"OW"}]},{"b":0.590,"d":0.180,"p":0.910,"t":"thank","w":[{"b":0.590,"d":0.060,"p":0.982,"t":"TH"},{"b":0.650,"d":0.060,"p":0.955,"t":"AE"},{"b":0.710,"d":0.030,"p":0.984,"t":"NG"},{"b":0.740,"d":0.030,"p":0.986,"t":"K"}]},{"b":0.770,"d":0.140,"p":0.944,"t":"you","w":[{"b":0.770,"d":0.100,"p":0.963,"t":"Y"},{"b":0.870,"d":0.040,"p":0.980,"t":"UW"}]},{"b":0.910,"d":0.270,"p":0.880,"t":"thank","w":[{"b":0.910,"d":0.040,"p":0.963,"t":"TH"},{"b":0.950,"d":0.130,"p":0.940,"t":"AE"},{"b":1.080,"d":0.070,"p":0.986,"t":"NG"},{"b":1.150,"d":0.030,"p":0.986,"t":"K"}]},{"b":1.180,"d":0.060,"p":0.939,"t":"you","w":[{"b":1.180,"d":0.030,"p":0.974,"t":"Y"},{"b":1.210,"d":0.030,"p":0.964,"t":"UW"}]},{"b":1.240,"d":0.470,"p":0.677,"t":"very","w":[{"b":1.240,"d":0.220,"p":0.790,"t":"V"},{"b":1.460,"d":0.040,"p":0.940,"t":"EH"},{"b":1.500,"d":0.050,"p":0.961,"t":"R"},{"b":1.550,"d":0.160,"p":0.949,"t":"IY"}]},{"b":1.710,"d":0.450,"p":0.720,"t":"much","w":[{"b":1.710,"d":0.030,"p":0.978,"t":"M"},{"b":1.740,"d":0.050,"p":0.973,"t":"AH"},{"b":1.790,"d":0.370,"p":0.758,"t":"CH"}]},{"b":2.160,"d":0.940,"p":0.507,"t":"<sil>","w":[{"b":2.160,"d":0.940,"p":0.507,"t":"SIL"}]}]}

I am trying https://github.com/jsalsman/featex, which was mentioned in https://cmusphinx.github.io/wiki/pocketsphinx_pronunciation_evaluation/, how do you think this repo? or do you have any advice about pocketsphinx to make it precise? Thanks very much!

jsalsman commented 1 year ago

Hi Yangang,

I'm the author of the page you mentioned and the repo to which you linked. Sadly both are out of date.

The "p" value is not a confidence score, and to the extent that it is, each one has its own different relative scale. The approach of trying to normalize them (including everything called "goodness of pronunciation" scores) is literally a dead end; the easiest way to understand why is to carefully read the part in https://www.isca-speech.org/archive/pdfs/interspeech_2015/loukina15_interspeech.pdf explaining that miscomprehensions are only due to pronunciation errors 14% of the time; the figure for how often such errors cause miscomprehensions is similarly very small. When you combine that with the fact that in the Common European Framework of Reference for Languages assessment criteria for "overall phonological control," intelligibility outweighs formally correct pronunciation at all levels, it is unavoidable that you want to measure intelligibility instead of any other measurement of pronunciation quality in any educational context except trying to train someone to exactly match one specific, invariable accent. If you do not, you are wasting the vast majority of the learner's time and will end up getting reviews such as you see at 2 minutes into https://www.youtube.com/watch?v=sKo6POdNyBI&t=120s -- Rosetta Stone's speech recognition lab director Barrett Davis apparently took my advice that intelligibility assessment is necessary to heart in 2018 when I was pitching him a contract; I am not sure whether or not I have yet convinced @lenzo-ka.

So, I have in the past recommended our more recent work you saw at https://arxiv.org/abs/1709.01713 and since February of this year, with the excellent articulatory feature extraction system at https://github.com/articulatory/articulatory you can use to supplement the model parameters at https://github.com/jsalsman/featex -- However, this approach requires that you have a relatively large quantity of learners' attempts at pronouncing words and phrases along with blind transcriptions of those utterances. There is no avoiding this data collection task, although you can use blinded transcriptions by the learners themselves, so it need not be expensive.

This is my current approach leveraging such learner transcriptions: https://i.ibb.co/ryTdVZ3/Screenshot-2023-06-06-7-19-00-AM.png

(diagram caption: Learners begin by answering questions in full sentences, then confirm the text of what they said. They then attempt to transcribe sentences spoken by other learners. The system identifies consequential mispronunciations from these transcriptions, repeating this process until sufficient data is gathered for generalization. The identified mispronunciations are then generalized by articulatory features over phonemes, diphones, syllables, and word(s), allowing for targeted remediation exercises. These exercises utilize natural spoken feedback, remixed to emphasize error locations. The system tracks the learner's progress throughout, tracking their improvement and areas that still need work. The process repeats, with the system selecting additional questions intended to elicit answers with specific words including the phones and segments in need of improvement for transcription by other learners.)

I can say with certainty that this approach is far more effective per time on task for intelligibility remediation than any current commercial or free products (like Google Search's and Microsoft's pronunciation assessment and remediation tools), and anything I have seen in the academic literature and patents, after a relatively exhaustive recent search.

I hope this helps. Please do not hesitate to follow up with more questions!

Best regards, Jim Salsman

YangangCao commented 1 year ago

I am very honor to receive your quick and precise reply! @jsalsman, I will read the information you recommend, Thanks,
I salute you.