alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.46k stars 1.05k forks source link

How to get more "love" word into VOSK? #1184

Open jamesoliver1981 opened 1 year ago

jamesoliver1981 commented 1 year ago

Hi, weird title I know. I'm trying to use VOSK on some tennis recordings where scores like "fifteen love" comes up. Sadly the model I am using is not great at picking up the "love" element, whether before or after. I have read that there are options to enhance word identification however I don't know if this will work ( and whilst there are some docs on how to adjust this, it looked a little beyond my capability, so I am posting this question first to get feedback).

The reason why I think this will NOT work is because I have built 2 VOSK models and simply changed the vocab. In the second, "love" is almost the only word in the custom dictionary, and there I can see that where this is picked up (timestamp) is in the middle of the prior word (ie fifteen).

Below my screen shots: Full grammer model output - fifteen is picked up between 23,82 & 24.208 image

Love Grammar model output - love is picked up at 24.15 (ie in the middle of the above) image

My planned approach is to run the model twice, each time outputting the word and the elements of result into a table to be able to construct the phrase. The only challenge here is that it double the run time.

My question is whether the enhancements of the language / specific grammar / increased probabilities will help resolve this issue. I have the same issue with "fifteen all" and there my solution doesn't work as "all" or a soundalike doesn't get picked up by a separate model.

I can provide example sound clips if that helps you help me.

My code:

def text_from_audio_v3( path, file, lang, location):
    os.chdir("D:/OneDrive/DataSci/Tennis/02_Preprocessing/Voice/VSOK/" + location)
    from pydub import AudioSegment
#     wf = AudioSegment.from_file(path + file)
    wf = wave.open(path + file, "rb")
    model = Model("model")

    if lang == "English":
#         rec = KaldiRecognizer(model, wf.getframerate(), '["love", "fifteen","thirty","forty","deuce", "mistake","winner","double", "fault", "second", "serve", "let","advantage", "my"]')
#         rec = KaldiRecognizer(model, wf.getframerate(), 
#                 '["love", "fifteen","thirty","forty","deuce", "mistake","winner","double fault", "second serve", "let"," my advantage","your advantage","all","game"]')
        rec = KaldiRecognizer(model, wf.getframerate(), 
                '["fifteen","love", "thirty","or","all", "forty", "deuce","juice", "game","mistake","winner","forced","second", "my advantage","your advantage", "his advantage"  ]')
# '["love","all", "or" ]')
#                                                             '["love", "fifteen","thirty","forty","deuce", "mistake","winner","forced","second", "my advantage","your advantage", "his advantage","all","each","game"  ]')
#     rec = KaldiRecognizer(model, wf.getframerate(), '["eins", "null", "fehler"]')
#     rec = KaldiRecognizer(model, wf.getframerate(), '["second", "serve", "love", "fifteen", "mistake", "thirty", "winner"]')
    results = []
    while True:
        data = wf.readframes(4000)
        if len(data) == 0:
            break
        if rec.AcceptWaveform(data):
            print(rec.Result())

    results.append(json.loads(rec.FinalResult())['text'])
    pprint.pprint(results)
    return results
nshmyrev commented 1 year ago

You need to rebuild graph. See https://alphacephei.com/vosk/lm

nshmyrev commented 1 year ago

And it helps if you provide audio files

jamesoliver1981 commented 1 year ago

The audio files can be found in this link
Those with a postfix love are the "love" examples, and those with "all" similar.

jamesoliver1981 commented 1 year ago

There are many elements I don't understand to the graph element so will come back to that in a second.
This is a json (ish) question: I am trying to read the breakdown of results to get the timing and probability of the word. I remove "text" here and get the full result. If I repalce this with "result" I get an error "string indices must be integers"

PS I absolutely love this tool and fully appreciate your help in helping me use it correctly

    results = []
    while True:
        data = wf.readframes(4000)
        if len(data) == 0:
            break
        if rec.AcceptWaveform(data):
            print(rec.Result())

    results.append(json.loads(rec.FinalResult())['text'])
jamesoliver1981 commented 1 year ago

Re rebuilding the graph, which element in the link you shared are you suggesting I work with - there is no element that specifically says rebuild the graph. Sorry, if this is a dump question but I don't see it