Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.7k stars 1.8k forks source link

Provide Custom Expected Phonemes In Pronunciation Assessment #2432

Open calebeno opened 3 weeks ago

calebeno commented 3 weeks ago

Is your feature request related to a problem? Please describe.

We are using pronunciation assessment for an educational app. A critical piece of our system is assessing students' pronunciation of various phonemes. After getting set up with Speech Studio using Pronunciation Assessment, we've been doing some testing and have run into a snag that a new feature would really help with: Providing custom expected pronunciation parameters for audio.

Let's take the word "Cal" for instance.

Azure Pronunciation Feature Request - Cal.zip

When you test these two, take look at the the pronunciation assessment result for the second phoneme for both: kæl cal : 96 k 77 æ 100 l 100

{
    "Phoneme": "æ",
    "PronunciationAssessment": {
        "AccuracyScore": 100,
        "NBestPhonemes": [
            {
                "Phoneme": "æ",
                "Score": 100
            },
            {
                "Phoneme": "t",
                "Score": 94
            },
            {
                "Phoneme": "aʊ",
                "Score": 79
            },
            {
                "Phoneme": "k",
                "Score": 52
            },
            {
                "Phoneme": "l",
                "Score": 18
            }
        ]
    },
    "Offset": 4100000,
    "Duration": 1400000
},

kel cal : 100 k 100 eɪ 100 l 100

{
    "Phoneme": "eɪ",
    "PronunciationAssessment": {
        "AccuracyScore": 100,
        "NBestPhonemes": [
            {
                "Phoneme": "eɪ",
                "Score": 100
            },
            {
                "Phoneme": "k",
                "Score": 32
            },
            {
                "Phoneme": "ɛ",
                "Score": 13
            },
            {
                "Phoneme": "l",
                "Score": 6
            },
            {
                "Phoneme": "æ",
                "Score": 4
            }
        ]
    },
    "Offset": 5100000,
    "Duration": 1500000
},

The thing to notice here is that the expected phoneme and the related scores have changed between the two pronunciations for the same word. It scores both as correct because the underlying expectation for what a correct pronunciation is has changed. My assumption is that this is due to Cal not being a proper word with potential deviations in correct pronunciation.

Let's try another "more real" example using the word "bass". There are two valid pronunciations of this word.

Azure Pronunciation Feature Request - Bass.zip

Here are the results:

bæs.wav bass : 100 b 100 æ 100 s 100

{
  "Phoneme": "æ",
  "PronunciationAssessment": {
      "AccuracyScore": 100,
      "NBestPhonemes": [
          {
              "Phoneme": "æ",
              "Score": 100
          },
          {
              "Phoneme": "ɑ",
              "Score": 21
          },
          {
              "Phoneme": "ə",
              "Score": 15
          },
          {
              "Phoneme": "ɛ",
              "Score": 14
          },
          {
              "Phoneme": "s",
              "Score": 12
          }
      ]
  },
  "Offset": 3500000,
  "Duration": 1500000
},

bes.wav bass : 100 b 100 eɪ 100 s 100

{
    "Phoneme": "eɪ",
    "PronunciationAssessment": {
        "AccuracyScore": 100,
        "NBestPhonemes": [
            {
                "Phoneme": "eɪ",
                "Score": 100
            },
            {
                "Phoneme": "b",
                "Score": 21
            },
            {
                "Phoneme": "s",
                "Score": 17
            },
            {
                "Phoneme": "æ",
                "Score": 5
            },
            {
                "Phoneme": "i",
                "Score": 3
            }
        ]
    },
    "Offset": 3600000,
    "Duration": 1700000
},

In both cases, the score adapts to the pronunciation.

Describe the solution you'd like

What I would like to do, is provide a reference text and an expected pronunciation. In the case of "bass" I could provide either "bæs" or "bes" (in the case of IPA, different with SAPI). The pronunciation assessment would then be evaluated based off of my provided pronunciation instead of adapting to the audio under test.

This would be additionally beneficial for nonsense words or domain-specific words that we want to test against a specific, expected pronunciation.

wangkenpu commented 3 weeks ago

Hi @calebeno , thank you for your feedback regarding our services. I completely understand your request, and it’s valuable for domain-specific word assessment. Unfortunately, we do not currently support this feature. Our next-generation service is in development, and your request has been included in our roadmap. However, it’s challenging to determine when the new capability will be generally available.

calebeno commented 3 weeks ago

@wangkenpu Thank you for the reply! I am encouraged that it is on your roadmap. We are able to use speech studio for the time being with the service as-is. I look forward to enhancing our feature set when this is implemented.

github-actions[bot] commented 4 days ago

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.