alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
8.15k stars 1.12k forks source link

How to use setGrammar() properly #1617

Closed Barkerww closed 3 months ago

Barkerww commented 3 months ago

I'm trying to create a list of words that can be combined to form the sentences I want. The part of materials are as follows:

private fun addUnit(): List<String> {
    return listOf(
        "a",
        "percent",
        "the",
        "minute",
        "minutes",
        "seconds",
        "second"
    )
}

private fun addPrep(): List<String> {
    return listOf(
        "to",
        "with",
        "for"
    )
}

However, I've found that using this list of words can cause issues for the recognizer. Specifically:

  1. The recognizer may struggle to distinguish homophones (words that sound the same but have different meanings) when using this list. like ("two", "to")
  2. The recognizer may not recognize the correct pronunciation of the word "a" (which should be pronounced as "/[æ]/" rather than "/[ə]/" ).

I'm looking for a better approach to address these issues when using the setGrammar() method. Specifically, I'd like to:

  1. Improve the recognizer's ability to identify the most likely interpretation when it encounters homophones.
  2. Enhance the recognizer's ability to accurately recognize single-character words like "a".

The example results I'm seeing are:

// The result I want is "three two one"
recog result: {
    "alternatives": [
        {
            "confidence": 86.934860,
            "text": "one to three"
        },
        {
            "confidence": 86.554771,
            "text": "want to three"
        },
        {
            "confidence": 86.311020,
            "text": "one two three"
        }
    ]
}

recog result: {
    "alternatives": [
        {
            "confidence": 163.708984,
            "text": "to three one"
        },
        {
            "confidence": 163.085144,
            "text": "two three one"
        },
        {
            "confidence": 162.654785,
            "text": "to three want"
        }
    ]
}

Can you please suggest any techniques or approaches that could help me address these issues and improve the recognition accuracy? Many Thanks!!

nshmyrev commented 3 months ago

Feels like you are using big model which doesn't support grammar.

Barkerww commented 3 months ago

Oh sorry, I forget to provide the details. The model I used is "vosk-model-small-en-us-0.15" on Android 11 device

nshmyrev commented 3 months ago

And what is your grammar in json form? "to" should not be there probably.

Barkerww commented 3 months ago

It will be like [ "a", "percent", "the", "minute", "minutes", "seconds", "second", "one", "two", "three", "four" "five" "to" "with" .... ] Is it a bad practice to put the Json string like this?

nshmyrev commented 3 months ago

You need to use phrases in the grammar, not separate words. "to" should be in context.

Barkerww commented 3 months ago

Great!! Thanks for the reply! Another question is about the pronunciation of 'a', is there any method to let model understand "/[æ]/" and "/[ə]/" in different situation?

nshmyrev commented 3 months ago

You can modify model vocabulary as described in https://alphacephei.com/vosk/lm and introduce two words a_1 with pronunciation AH and a_2 with pronunciation AE. After that you can use the new words in grammar.

Barkerww commented 3 months ago

Thank you so much @nshmyrev I'll start trying to figure out how to add the pronunciation inside.