Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.97k stars 1.87k forks source link

Pronunciation Assessment Not Correctly Differentiating Between Certain Vowel Sound Phonemes #2431

Open calebeno opened 5 months ago

calebeno commented 5 months ago

Speech SDK log taken from a run that exhibits the reported issue.

AzureSpeechLogFile.txt

A stripped-down, simplified version of your source code that exhibits the issue. Or, preferably, try to reproduce the problem with one of the public samples in this repository (or a minimally modified version of it), and share the code.

The bug is not a product of implementation as far as I can tell. It is also reproducible by using the online speech studio page at https://speech.microsoft.com/portal/ and then under pronunciation assessment.

If relevant, a WAV file of your input audio.

I have included several examples of pronunciations of the word "lap". Each file is marked in with the IPA pronunciation. The correct pronunciation of the word "lap" is læp

Azure Pronunciation Assessment Tests.zip

For assistance with decoding the IPA annotations: https://www.vocabulary.com/resources/ipa-pronunciation/

Additional information as shown below

Describe the bug

When using pronunciation assessment, the accuracy results are not aligning to what I would expect. For the purposes of this bug, I have provided test audio files related to the word "lap", though similar behavior is noticeable with other words using different vowel sounds. Taking the word "lap" as the example on the speech studio test site, here are some results based on the audio files I have provided:

Note: We have our own scoring methodology that first compares the expected phoneme with the highest score in the NBest list. If those phonemes do not match, we score it as a mispronunciation. If they do match, there is an accuracy threshold that it must clear in order to be counted as a pass.

læp.wav lap : 100 l 100 æ 100 p 100 (This is exactly what I would expect, it is the correct pronunciation)

lap : 89 l 100 æ 84 p 79 I would expect this to be slightly lower for accuracy. In the NBest list:

{
  "Phoneme": "ɑ",
"Score": 100
},
...
{
  "Phoneme": "æ",
  "Score": 50
}

It correctly identifies the phoneme that is actually in the audio. I would expect this to be a mispronunciation. With our scoring, it would be.

lep.wav lap : 65 l 82 æ 29 p 42 This correctly scores the æ very low, as I would expect (though the analysis code still doesn't score the word as a mispronunciation because the word score is 65. In our scoring method this would be counted as a mispronunciation)

lɪp.wav lap : 87 l 93 æ 70 p 75 Here again, the accuracy score is fairly high, but the NBest list shows:

{
    "Phoneme": "ɪ",
    "Score": 100
},
{
    "Phoneme": "ɛ",
    "Score": 91
},
{
    "Phoneme": "æ",
    "Score": 20
},

Our system would count this as a mispronunciation, but I'm surprised at how high the accuracy score is for that phoneme (70).

lʌp.wav lap : 92 l 98 æ 100 p 80

{
    "Phoneme": "ʌ",
    "Score": 100
},
{
    "Phoneme": "æ",
    "Score": 66
}

Here's another example of the accuracy score being 100 for the phoneme when the highest-scoring phoneme is not the expected one.

lɛp.wav lap : 98 l 100 æ 100 p 94

{
    "Phoneme": "æ",
    "Score": 100
},
{
    "Phoneme": "ɛ",
    "Score": 100
},

This one is the most problematic. NBest records both "æ" and "ɛ" as 100. Because "æ" comes first in the list our system marks this as correct. Similar to the others, I would expect "æ" to be marked as a lower score while "ɛ" should be at the top.

To Reproduce

Go to https://speech.microsoft.com/portal/ and then under Pronunciation Assessment. Enter "lap" into the custom text field. Upload the different audio files I have provided (or do your own testing by alternating the vowel sounds between "l" and "p". Inspect the JSON.

Expected behavior

See notes in the description. I would expect phonetic analysis, especially of "æ" and "ɛ" to be correctly differentiated.

Version of the Cognitive Services Speech SDK

SDK for Unity 1.37.0 SDK for Javascript 1.36.0

Platform, Operating System, and Programming Language

wangkenpu commented 5 months ago

@calebeno Thanks for your feedback regarding our services. Currently, NBestPhonemes and word/phone accuracy scores are from different models. We‘ve noticed the gap. We need some internal discussions to decide how to improve the score.

calebeno commented 5 months ago

@wangkenpu Thank you. That's good to know the reason behind the discrepancy. For now, NBestPhonemes seems to be the better evaluatory option for our particular use case.

github-actions[bot] commented 4 months ago

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.