Cyborgscode / Personal-Voice-Assistent

Building a fully featured and localized voice assistant for Linux
146 stars 5 forks source link

Feature Request - Proposed keyword optimization Improvement to pva.py and start.sh #9

Open raithel opened 2 years ago

raithel commented 2 years ago

Hello, I really love this project and the work that you guys are doing here!

I would like to propose an improvement optimization to the pva.py file on this project.

I had downloaded and setup PVA on my machine, and noticed that in noisy environments; the vosk Speech recognition will start to bog down my machine. (it seems to pin one of my CPU cores are nearly 100%)

When this happens the recognition can take 30 sec to up to a minute plus. This seems to be due to vosk trying to make sense of all audio at all times.

So I propose that a change be made to the pva.py, and the start.sh files.

My apologies if I am presenting this in a format that is incorrect, I am relatively new to using git.

First a new keyword flag should be added to the pva.py file.

parser.add_argument(
    '-k', '--keyword', type=str, help='keyword')

Then in the start.sh file, grab the keyword from the config file and pass it to pva.py:

KEYWORD=$(grep "^conf:.*keyword" /etc/pva/conf.d/* $HOME/.config/pva/conf.d/* $HOME/.config/pva/pva.conf | tail -n 1 | sed -e "s/^.*,//g" -e "s/\"//g")
### .......... rest of code in start.sh file ----------- ###
./pva.py -m $MODEL -k $KEYWORD >> $HOME/.var/log/pva.log

This keyword flag would be used for a 2nd vosk model instance that can be initialized to only recognize the specified keyword. This method drastically reduces CPU overhead as it only listens for the specified keyword.

kywrd_rec = vosk.KaldiRecognizer(model, args.samplerate, '[ "'+keyword+'", "[unk]" ]')

This should then run in a loop surrounding the primary audio grab loop.

while True:
                data2 = q.get()
                if kywrd_rec.AcceptWaveform(data2):
                    if kywrd_rec.Result()[14:-3] == keyword:
                        print("[+] Keyword of ["+keyword+"] recognized!")
                        sys.stdout.flush()
                        ## send a random response to a TTS engine, so you know when you can give commands to pva.
                        os.system("espeak '"+response_lst[randint(0,(len(response_lst) -1))]+"'")
                        ## Begin full VTS

That way vosk will only do full Speech to Text recognition after the keyword has been identified. Thus saving and reserving processing power for when actual commands want to be relayed via vosk.

Here is a working proof of concept that I am currently using on my machine: NOTE: I had implemented a quick hack to allow for the pva.py file to be able to provide auditory feedback via espeak, but this can be changed to use a different method of feedback or tts engine if needed.

pva.py:

#!/usr/bin/env python3

# This Script is modified for PVA usage
# it's originally shipped by the VOSK package from Alpha Cephi itself.
#
# Modified to have more effiencent keyword recognition.
# Date: 2022-07-13
#
# TODO:
#  DONE - Add flag that accepts a keyword as an argument.
#  DONE - have start.sh pass the keyword into pva.py as a flag.
#  - Maybe have path to TTS program passed to pva.py as a flag instead of using straight espeak.
#
#

import argparse
import os
import queue
import sounddevice as sd
import vosk
import sys
### used for randomized feedback responses.
from random import randint

q = queue.Queue()
#keyword = "computer"

def int_or_str(text):
    """Helper function for argument parsing."""
    try:
        return int(text)
    except ValueError:
        return text

def callback(indata, frames, time, status):
    """This is called (from a separate thread) for each audio block."""
    if status:
        print(status, file=sys.stderr)
    q.put(bytes(indata))

parser = argparse.ArgumentParser(add_help=False)
parser.add_argument(
    '-l', '--list-devices', action='store_true',
    help='show list of audio devices and exit')
args, remaining = parser.parse_known_args()
if args.list_devices:
    print(sd.query_devices())
    parser.exit(0)
parser = argparse.ArgumentParser(
    description=__doc__,
    formatter_class=argparse.RawDescriptionHelpFormatter,
    parents=[parser])
parser.add_argument(
    '-m', '--model', type=str, metavar='MODEL_PATH',
    help='Path to the model')
parser.add_argument(
    '-d', '--device', type=int_or_str,
    help='input device (numeric ID or substring)')
parser.add_argument(
    '-r', '--samplerate', type=int, help='sampling rate')
## Added in flag to accept keyword from start.sh file
parser.add_argument(
    '-k', '--keyword', type=str, help='keyword')
args = parser.parse_args(remaining)

try:
    if args.model is None:
        args.model = "model"
    if not os.path.exists(args.model):
        print ("Please download a model for your language from https://alphacephei.com/vosk/models")
        print ("and unpack as 'model' in the current folder.")
        parser.exit(0)
    if args.samplerate is None:
        device_info = sd.query_devices(args.device, 'input')
        # soundfile expects an int, sounddevice provides a float:
        args.samplerate = int(device_info['default_samplerate'])
    if args.keyword is None:
        print("[!] No keyword specified, defaulting to carola as keyword!")
        args.keyword = "carola"

    keyword = args.keyword
    model = vosk.Model(args.model)

    with sd.RawInputStream(samplerate=args.samplerate, blocksize = 8000, device=args.device, dtype='int16',
                            channels=1, callback=callback):
            print('#' * 80)
            print('Press Ctrl+C to stop the recording')
            print('#' * 80)

            ## List of randomized responses to be used as feedback.
            ## That way you know when full VTS is avalible to say a command.
            response_lst = ["Yes sir?","Yes?","Your command?","what is it?"]

            rec = vosk.KaldiRecognizer(model, args.samplerate)

            kywrd_rec = vosk.KaldiRecognizer(model, args.samplerate, '[ "'+keyword+'", "[unk]" ]')
            while True:
                data2 = q.get()
                if kywrd_rec.AcceptWaveform(data2):
                    if kywrd_rec.Result()[14:-3] == keyword:
                        print("[+] Keyword of ["+keyword+"] recognized!")
                        sys.stdout.flush()
                        ## send a random response to a TTS engine, so you know when vosk is ready to receive full commands for pva.
                        os.system("espeak '"+response_lst[randint(0,(len(response_lst) -1))]+"'")
                        ## Begin full VTS
                        while True:
                            data = q.get()
                            if rec.AcceptWaveform(data):
                                # print(rec.Result() // instead of printing the final sentence, we give it to pva as JSON
                                ## because we are identifying the keyword seperatly, we need to re-add it back to the JSON for pva to use.
                                str = rec.Result().replace(' : "',' : "'+keyword+' ');
                                os.system( "java PVA '"+ str.replace("'","")  +"'");
                                break
                    else:
                        print(" [!] Got voice data, but keyword was not recognized!")
                        sys.stdout.flush()

except KeyboardInterrupt:
    print('\nDone')
    parser.exit(0)
except Exception as e:
    parser.exit(type(e).__name__ + ': ' + str(e))

start.sh:

#!/bin/bash

if [ "$USER" == "root" ]; then
        if [ ! -f /root/.config/pva.root.overwrite ]; then
                echo "please, do not run this as root, it does only work for desktopsessions!"
                exit;
        fi
fi

cd /usr/share/pva/

DATE=$(date -R)

echo "PVA starting $DATE" >> $HOME/.var/log/pva.log

if [ ! -e $HOME/.config/pva ]; then

        mkdir -p $HOME/.config/pva/conf.d

fi

if [ ! -e $HOME/.cache/pva ]; then

        mkdir -p $HOME/.cache/pva/audio

fi

if [ ! -f $HOME/.config/pva/conf.d/02-paths.conf ]; then

        echo "path:\"video\",\"$HOME/Videos\""    > $HOME/.config/pva/conf.d/02-paths.conf
        echo "path:\"pics\",\"$HOME/Pictures\""    >> $HOME/.config/pva/conf.d/02-paths.conf
        echo "path:\"music\",\"$HOME/Music\""    >> $HOME/.config/pva/conf.d/02-paths.conf
        echo "path:\"docs\",\"$HOME/Documents\"" >> $HOME/.config/pva/conf.d/02-paths.conf
fi

if [ ! -e $HOME/.var/log ]; then

        mkdir -p $HOME/.var/log

fi

MLANG=$(grep "^conf:.*lang_short" /etc/pva/conf.d/* $HOME/.config/pva/conf.d/* $HOME/.config/pva/pva.conf | tail -n 1 | sed -e "s/^.*,//g" -e "s/\"//g")
KEYWORD=$(grep "^conf:.*keyword" /etc/pva/conf.d/* $HOME/.config/pva/conf.d/* $HOME/.config/pva/pva.conf | tail -n 1 | sed -e "s/^.*,//g" -e "s/\"//g")
MODEL=$(ls -d *-$MLANG-*);

GREETING=$(grep "^conf:.*greeting" /etc/pva/conf.d/* $HOME/.config/pva/conf.d/* | tail -n 1 | sed -e "s/^.*,//g" -e "s/\"//g")
if [ "$GREETING" != "" ]; then
        SAY=$(grep say $HOME/.config/pva/pva.conf| sed -e "s/^.*,//g" -e "s/\"//g" -e "s/%VOICE/$MLANG/g" -e "s/x:x/ /g")

        if [ "$SAY" == "" ]; then
                SAY="/usr/local/sbin/say"
        fi
        $SAY "$GREETING"
fi

./pva.py -m $MODEL -k $KEYWORD >> $HOME/.var/log/pva.log

If you have any questions for me, or need any additional information on this matter, please feel free to contact me!

Thank you for your time and work on this awesome project!

Cyborgscode commented 2 years ago

Vosk runs ~ 60% max on a Pinephone with a 4core arm gpu with the small-de-model, so either you are running vosk on an even less powerful cpu, or you use the large model. In both cases, I suggest to change it on the hw level.

You solution has one major design flaw: It party disables functionality ;)

"What is your name" won't work anymore, same as all of the reaction stuff and this includes the ability to answere the phone, which needs to have pre-supplied reactions, because callers do not know how to address the bot they are calling ;)

I suggest to either:

But in no way will I limit this pva to only keyword based interactions, as i will get killed by my family if i do so :D They love to talk to their PC's :)

BTW: the main config at /etc/pva/conf.d/01-default.conf explicit sets "carola" as keyword, you don't need to hardcode it in pva.py if you handled it correctly in "pva" .

raithel commented 2 years ago

Thanks for the reply!

Here are the current specs that I am running PVA on: OS: Ubuntu 18.04.6 LTS vosk: Version: 0.3.42 Language model: vosk-model-en-us-0.22-lgraph (128MB model)

Hardware: Thinkpad E560 CPU: Intel Core i7-6500U (Dual core with hyper threading.) GPU: Discreet AMD Radeon R7 M360

When running vosk in separate tests, I get similar results of vosk getting bogged down with background noise. (AC or fan running in background etc.) Even when I use the smallest 40MB English model.

When it gets bogged down for me, one of my CPU cores gets pinned at 85% and it can take from 10s of seconds to minutes for the recognition to "catch up". This delay seems to accumulate as more time passes.

All that being said, I would hate to take away the functionality to be able to casually talk to your computer! ; )

However I think having this functionality as an option is extremely useful, especially for laptops or devices that are used in noisy environments. I guess there is a reason smartphones and other voice activated devices use similar keyword functionality for their voice assistants. : )

I believe the best option would be having the two loops. One keyword based loop, and a conversational based loop. The respective loop could be selected by something like the -k argument.

That way you would be able to toggle between using conversational interaction and keyword interaction at startup.

Maybe adding a "hibernate" option into the /etc/pva/conf.d/01-default.conf for toggling would make sense for this kind of functionality?

something like:

# toggle hibernating PVA until keyword is used. 
conf:"hibernate","yes"

where the config default would be no.

This could then be used by start.sh and passed to pva.py as a flag to allow for keyword functionality to be toggled at startup.

I will see if I can provide to you a patch for pva.py start.sh and /etc/pva/conf.d/01-default.conf that allows for you to toggle this functionality using a flag set in the conf file.

With regards to the hard-coding "carola" into the pva.py. My apologies, I had intended to remove that in the version I pasted into my comment. I had put that as a placeholder for logic to be used if the -k flag was not passed into the pva.py. : )

Thanks again for your time and effort on this project!

Cyborgscode commented 2 years ago

I have a Surface Pro 4 with i7-6th running with the same kind of problem. It also drains the battery, so I disabled the pva on that device too. see https://github.com/alphacep/vosk-api/issues/1059 pls add your sightings there, and hopefully it gives them something to think about.

Regarding the config option:

That only works on (re)start of pva script... hey.. wait.. it technically could also work on runtime, now that we have a pva task and a STT task. My python is terrible, do you think you can rebuild pva.py to the two loop version, depending on the option in

/etc/pva/conf.d/01-default.conf overwritten by ~/.config/pva/conf.d/*.conf overwritten by ~/.config/pva/pva.conf

This will work fine:

# toggle hibernating PVA until keyword is used. 
conf:"hibernate","yes"

I will add the needed other config options and functionality.

Cyborgscode commented 2 years ago

add a watch inside the loop for changes in ~/.config/pva/pva.conf . That way I don't need to kill and restart the python script.

Cyborgscode commented 2 years ago

pva code & config prepared, it's up to you:


readconfig()

shallweend = false;
while ( ! shallweend ) {
       if ( hiberate ) {
                 init vosk
                 while( true ) {
                       handle vosk
                       read in pva.conf -> hibernate 
                       onchange -> break
                 }
                 close vosk
        } else {
                 init vosk
                 while( true ) {
                       handle vosk
                       read in pva.conf -> hibernate 
                       onchange -> break
                 }
                 close vosk
       }                 
raithel commented 2 years ago

Wow! this looks great. I think that logic should work! :)

I am leaving for a weekend trip today, so I should be able to implement something on Monday! I'm pretty comfortable with python, so this should not be too bad for me to implement. ; )

With regards to the performance issues, I will see if I can post my findings at the link that you mentioned. alphacep/vosk-api/issues/1059 Thanks for bringing that opened issue to my attention. : )

Finally, I have some other minor proposals for adding some other simple features. What would be the best way you would like for me to propose these additional features?

I have already implemented some of these features to my personal install of PVA, but I think they could be useful to the rest of the people using PVA. (Like adding in support for PVA to map pressing keyboard shortcuts, and scroll the mouse etc..)

Thank you so much for your time and patience with me on this matter!

Cyborgscode commented 2 years ago

Open Issues as "Request: ..."

And for what purposes do you need such functionality? Maybe you wanne check with "DeepIn" Desktop Environment ?

https://www.deepin.org/en/

AFAIK, they have deeply integration for apps into the desktop itself, to support i.e. assistants to interact with everything. I never tested it, but heared interesting ideas, in special, if we wanne integrate PVA in handicapped peoples lifes.

raithel commented 2 years ago

okay, I think I have something that provides basic proof of concept of switching between keyword based recognition and full recognition on the fly. The example below watches the pva.conf file in your $HOME/.config/pva/ directory and if the pva.conf file gets modified, then it will check to see if conf:"hibernate","no" value has been changed in the pva.conf. If it has been changed, it will terminate the current execution loop function and switch to the other specified loop function.

Give this below code a try, and let me know if it works for you. If not, let me know if you have any issues with the pva.py code that I have provided below.

#!/usr/bin/env python3

# This Script is modified for PVA usage
# it's originally shipped by the VOSK package from Alpha Cephi itself.
#
# Modified to have more effiencent keyword recognition.
# Date: 2022-07-13
#
#Modified to have live toggling of keyword recognition and full recognition  loops.
#Date: 2022-07-19
#
# TODO:
#  DONE - Add flag that accepts a keyword as an argument.
#  DONE - have start.sh pass the keyword into pva.py as a flag.
#  - Maybe have path to TTS program passed to pva.py as a flag instead of using straight espeak.
#
#

import argparse
import os
import queue
import sounddevice as sd
import vosk
import sys
### used for randomized feedback responses.
from random import randint

q = queue.Queue()
#keyword = "computer"

### Specify config file for pva.py to listen to for changes.
HOME_DIR = os.environ.get("HOME")
config_file_path = HOME_DIR+"/.config/pva/pva.conf"

def int_or_str(text):
    """Helper function for argument parsing."""
    try:
        return int(text)
    except ValueError:
        return text

def callback(indata, frames, time, status):
    """This is called (from a separate thread) for each audio block."""
    if status:
        print(status, file=sys.stderr)
    q.put(bytes(indata))

parser = argparse.ArgumentParser(add_help=False)
parser.add_argument(
    '-l', '--list-devices', action='store_true',
    help='show list of audio devices and exit')
args, remaining = parser.parse_known_args()
if args.list_devices:
    print(sd.query_devices())
    parser.exit(0)
parser = argparse.ArgumentParser(
    description=__doc__,
    formatter_class=argparse.RawDescriptionHelpFormatter,
    parents=[parser])
parser.add_argument(
    '-m', '--model', type=str, metavar='MODEL_PATH',
    help='Path to the model')
parser.add_argument(
    '-d', '--device', type=int_or_str,
    help='input device (numeric ID or substring)')
parser.add_argument(
    '-r', '--samplerate', type=int, help='sampling rate')
## Added in flag to accpet keyword from start.sh file
parser.add_argument(
    '-k', '--keyword', type=str, help='keyword')
args = parser.parse_args(remaining)

def check_config_hibernate(conf_file_path=config_file_path):
    """This function gets the status of the hibernate configuration option at the specified config file location. It will return True or False based on the option in the configured file."""
    try:
        file = open(conf_file_path,'rt')
        file_contents = file.readlines()
        file.close()
    except:
        print("[!] Error! File at config location of: "+conf_file_path+" \n Could not be found or opened!")
    try:
        conf = [ l for l in file_contents if l.startswith('conf:') ]
        hibernate_status = [ l for l in conf if 'conf:"hibernate"' in l ][0].split(",")[1].strip('\n').strip('"').lower()
        if (hibernate_status == 'yes' or hibernate_status == 'true'):
            return True
        if (hibernate_status == 'no' or hibernate_status == 'false'):
            return False
    except:
        print('\n[!] could not find conf:"hibernate","no" parameter in '+conf_file_path+' defaulting to "hibernate","no"!\n')
        return False

try:
    if args.model is None:
        args.model = "model"
    if not os.path.exists(args.model):
        print ("Please download a model for your language from https://alphacephei.com/vosk/models")
        print ("and unpack as 'model' in the current folder.")
        parser.exit(0)
    if args.samplerate is None:
        device_info = sd.query_devices(args.device, 'input')
        # soundfile expects an int, sounddevice provides a float:
        args.samplerate = int(device_info['default_samplerate'])
    if args.keyword is None:
        print("[!] No keyword specified!")
#        args.keyword = "computer"

    keyword = args.keyword
    model = vosk.Model(args.model)

    with sd.RawInputStream(samplerate=args.samplerate, blocksize = 8000, device=args.device, dtype='int16',
                            channels=1, callback=callback):
            print('#' * 80)
            print('Press Ctrl+C to stop the recording')
            print('#' * 80)
            rec = vosk.KaldiRecognizer(model, args.samplerate)

            ## List of randomized responses to be used as feedback.
            ## That way you know when full VTS is avalible to say a command.
            response_lst = ["Yes sir?","Yes?","Your command?","what is it?"]

            kywrd_rec = vosk.KaldiRecognizer(model, args.samplerate, '[ "'+keyword+'", "[unk]" ]')

            def full_speech_recognition():
                """This function runs the speech recognition algorithim"""
                print("[+] Initalized full Speech recognition loop function.")
                sys.stdout.flush()
                config_mod_time = os.stat(config_file_path).st_mtime
                while True:
                    data = q.get()
                    if rec.AcceptWaveform(data):
                        # print(rec.Result() // instead of printing the final sentence, we give it to pva as JSON
                        ## because we are identifying the keyword seperatly, we need to re-add it back to the JSON for pva to use.
                        str = rec.Result() #.replace(' : "',' : "'+keyword+' ');
                        os.system( "java PVA '"+ str.replace("'","")  +"'");

                    if config_mod_time != os.stat(config_file_path).st_mtime:
                        print("[!] config has been modified! now checking if hibernate parameter has been changed.")
                        sys.stdout.flush()
                        if check_config_hibernate(config_file_path):
                            print(' [!!] conf:"hibernate" parameter has been set to yes! now switching to keyword recognition loop!!')
                            sys.stdout.flush()
                            break
                        else:
                            print(' [=] conf:"hibernate" paramter is still no, continuing full speech recognition model.')
                            sys.stdout.flush()
                            config_mod_time = os.stat(config_file_path).st_mtime

            def keyword_optimized_recognition(keyword):
                """Listens for only specified keyword, once keyword is recognized, then full recognition is initiated."""
                print("[+] Initialized keyword Speech recognition loop function.")
                sys.stdout.flush()
                ## List of randomized responses to be used as feedback.
                ## That way you know when full VTS is avalible to say a command.
                response_lst = ["Yes sir?","Yes?","Your command?","what is it?"]

                config_mod_time = os.stat(config_file_path).st_mtime
                while True:
                    data2 = q.get()
                    if kywrd_rec.AcceptWaveform(data2):
                        if kywrd_rec.Result()[14:-3] == keyword:
                            print("[+] Keyword of ["+keyword+"] recognized!")
                            sys.stdout.flush()
                            ## send a random response to a TTS engine, so you know when you can give commands to pva.
                            os.system("espeak '"+response_lst[randint(0,(len(response_lst) -1))]+"'")
                            ## Begin full VTS
                            while True:
                                data = q.get()
                                if rec.AcceptWaveform(data):
                                    # print(rec.Result() // instead of printing the final sentence, we give it to pva as JSON
                                    ## because we are identifying the keyword seperatly, we need to re-add it back to the JSON for pva to use.
                                    str = rec.Result().replace(' : "',' : "'+keyword+' ');
                                    os.system( "java PVA '"+ str.replace("'","")  +"'");
                                    break
                        else:
                            print(" [!] Got voice data, but keyword was not recognized!")
                            sys.stdout.flush()

                    if config_mod_time != os.stat(config_file_path).st_mtime:
                        print("[!] config has been modified! now checking if hibernate parameter has been changed.")
                        sys.stdout.flush()
                        if check_config_hibernate():
                            print(' [=] conf:"hibernate" parameter is still yes, continuing keyword voice recognition.')
                            sys.stdout.flush()
                            config_mod_time = os.stat(config_file_path).st_mtime
                        else:
                            print(' [!!] conf:"hibernate" parameter has been set to no, toggling to full speech recognition model.')
                            sys.stdout.flush()
                            break

            while True:
                if check_config_hibernate(config_file_path):
                    print("[+] Hibernation option in config has been set to yes. Now starting keyword optimized based recognition.")
                    sys.stdout.flush()
                    keyword_optimized_recognition(keyword)
                else:
                    print("[+] Hibernation option in config has been set to no. Now starting full recognition.")
                    sys.stdout.flush()
                    full_speech_recognition()

except KeyboardInterrupt:
    print('\nDone')
    parser.exit(0)
except Exception as e:
    parser.exit(type(e).__name__ + ': ' + str(e))

Hopefully this example should work for the solution that you are looking to have. If not, please feel free to let me know what kind of revisions you would like for me to make.

As always, thank you for your time and patience with me on this matter!

Cyborgscode commented 2 years ago

Why did you add your own responses into the loop code. This means, everyone would need to listen to espeak and it's always english and not optional and it's outdated already, as we don't use

os.system( "java PVA '"+ str.replace("'","") +"'");

anymore. BTW, it's not helpful for the user, if your loop response says "Yes, Sir" and the final result is "I have no idea what you said". It's best to remove this from the pva.py loops completely. They are simply the supply of the spoken wordstream, nothing more.

Cyborgscode commented 2 years ago

any progress?