Information gathering for voice assistant implementation via porcupine and vosk

hillbicks commented 2 years ago

Another issue from me!

I thought it would be a good idea to gather all the information that is needed for an offline voice assistant with this project using the LX06 as hardware. Information from this issue could then be the basis for a more detailed HowTo guide.

Here is what I figured out so far: 1) porcupine (wake word) detector is already implemented and working. the script porcupine_launcher is launched as a service. by default, porcupine listens to the wakeword "alexa" and stores the recorded speech in /tmp/stt.wav 2) notify sound file missing /usr/share/sound/wakeup2.mp3 3) Mute button already working, stops porcupine and activates/deactivates the LED. 4) The script uses stt.wav and sends it for STT to home assistant. I guess @duhow has a vosk instance already running as an addon within HA. 5) the response from the curl POST in 4) is saved asa text file and then played via tts_google on the speaker again. If you sometimes hear a asian sounding voice coming out of the speaker, it is actually tts_google saying the word error. Took to me way to long to figure this out. Basically everytime the wakeword is detected, this will be the response since the response file will be missing.

Next steps for me: I already set up a docker instance of vosk-server on a remote host, which is working fine with the test scripts provided by their repo. What I haven't figured out: How I can post the stt.wav file directly to the websocket instance of vosk and generate a text file in response to the posted input wav file. The examples provided by vosk rely on additional packages being installed on the client. In the case of python, we're missing asyncio and websockets. I think using curl would be the solution with the smallest footprint. Using the curl command from porcupine_launcher doesn't generate any output either.

Another thing I haven't quite figured out yet: I was expecting the analysed text response to be visible on the websocket instance of vosk, but so far there is absolute silence going on. Either I'm doing something wrong or I'm stupid. Not to say it can't be both.

That's it for the moment with my rambling thoughts, I'll update here once I make further progress. If anybody has further input, please feel free to contribute to this discussion.

duhow commented 2 years ago

Vosk server is websocket, and the way its author programmed it is a bit special. Thing is, you won't be able to send that file only with curl, since you need to continue receiving data while sending the file. websocat works but only on PC, trying the prebuilt binary with speakers does not work properly, connection hangs up. And I don't want to code another program in C for doing websocket connections. That's why I'm working in creating a Vosk custom_compontent for Home Assistant and using it as an STT provider, so that way I can send audio and get the result text. For debug purposes and until I decide how to process that text, I'm just repeating it locally with Google TTS. Ideally I can send it back to Home Assistant API conversation to trigger an Intent / Command. But once again, all those actions have to be coded manually. Almond would be ideal here, but only works in English - I'd like to use it in Spanish.

hillbicks commented 2 years ago

Interesting!

Have you looked at the functionality of rhasspy for the voice assistant? Since they offer an http/mqtt endpoint, it might be easier to just use that. There is also work being done on integrating vosk as a service within the rhasspy project.

I'll start experimenting with their API over the next couple of days, maybe that it's an option.

oh, btw. what language is tts_google using? BEcause the pronunciation of error sounds really weird :p

duhow commented 2 years ago

If there's a way to integrate Rhasspy into Home Assistant, then I'll may try it.

https://github.com/duhow/xiaoai-patch/blob/e54212d0249ede93860dc2342b5e655d837cd4bb/bin/tts_google#L4

hillbicks commented 2 years ago

Already available as an addon (third party) for HA.

https://github.com/synesthesiam/hassio-addons

Rhasspy also has a very good integration with HA when it comes to intents.

Rhasspy communicates with Home Assistant directly over its REST API. Specifically, Rhasspy intents are POST-ed to the /api/intents/handle endpoint.

You must add intent: to your Home Assistant configuration.yaml to enable the endpoint.

To get started, check out the built-in intents. You can trigger them by simply naming your Rhasspy intents the same:

[HassTurnOn]
turn on the (bedroom light){name}

[HassTurnOff]
turn off the (bedroom light){name}

If you have an entity named "bedroom light" in Home Assistant, you can now turn it on and off from Rhasspy!

Documentation is here

In addition, it is also straightforward to integrate it with nodered. Next chapter in the documentation.

hillbicks commented 2 years ago

Ok, so progess update. This is way easier than I thought and we basically already have everything in place for the basics.

With just two messages to mqtt and a little bit of logic in node red, I'm able to turn my office light on and off. Activate the hotword listener and then send the recorded message to rhasspy, listen to the rhasspy websocket via node red, convert the json payload and call the home assistant service.

I'll modify the porcupine_launcher script so that it works with my setup in order to see how good the wakeword detection with the hardware on the LX06 really is.

But so far, this looks promising. Will keep you posted.

EDIT: Well, as always it seems. Hotword detection is quite good when there's no other sound, even from a distance. But turn on the radio on the LX06 and the hotword detection basically gets unusable. You have to yell more or less and do it several times in order to "launch" porcupine. I modified the script to lower the volume to 10% before listening for the command and then restore it, so that works great. But the hotword detection leaves a lot of room for improvement.

hillbicks commented 2 years ago

Here is a summary of what I've done so far and the things that are not yet working:

The goal for me was to have an offline working voice assistant that could be integrated with Home Assistant and node red. With the basic setup of this github project, a couple of modifications of the porcupine_launcher script and a seperate rhasspy instance, I basically got what I wanted..

instead of vosk, I went with rhasspy. rhasspy can act as the brain of the voice recognition so speech analysis and intents don't have to run on less powerful devices like the LX06/LX01. rhasspy can be configured as the base or a satellite instance. Satellites normally are installed on a Pi zero (or something similar), but are still fully fledged rhasppy installs. But you don't actually need the rhasspy install on a satellite, porcupine and an mqtt client is actually sufficient.

The way it works:

porcupine is constantly listening on the LX06 for the defined wakeword. Once that is recognised, it will send a mqtt message to the rhasspy base that a hotword was detected.
the porcupine script will record x seconds after the wakeword and send the recorded wav file over mqtt.
rhasspy takes the wav file from mqtt, runs speech to text on the wav file
the resulting text is checked against the intent definition on rhasspy. "what time is it" gets tagged with GetTime for example
the result is available via websocket as json. It contains the tagged intent name and the words that were recognised (among other things).
node red is listening on the websocket from rhasspy. Then it is just "standard" node red logic, parse the json message into a payload message that you can work with and then comes the rest of the logic. For example, if intent name = GetTime, then get the current time and send it via TTS back to the speaker.

Instead of node red, you can work directly with the intent interface of HA if you prefer that, there are also scripts that pull in all devices trom HA into rhasspy. I highly recomment the rhasspy documentation

Short HowTo on how to get here:

Install rhasspy

configure it like this:

mqtt: external or internal, just make sure that you use the same credentials in the porcupine_launcher script
audio recording: disabled
wake word: hermes mqtt
speech to text: kaldi
intent recognition: fsticuffs
text to speech: disabled
audio playing: disabled
dialogue management: rhasspy
intent handling: disabled

replace rows 58 to 64 of the porcupine_launcher script (packages/porcupine/config/launcher) with something like this:

# lower volume on the speaker in case anything is currently playing
amixer set mysoftvol 10%
# publish a mqtt message for rhasspy
mosquitto_pub -h $MQTT_HOST -p 1883 -u $MQTT_USER -P $MQTT_PASS -i me --qos 1-t 'hermes/hotword/default/detected' -m '{"siteId": "default" ,"modelId":"null"}'
# start recording and save the file once the defined time is up
arecord -N -D$MIC -d $TIME -f S16_LE -c $CHANNEL -r $RATE /data/message.wav
# publish the wav file as an mqtt message
mosquitto_pub -h $MQTT_HOST -p 1883 -u $MQTT_USER -P $MQTT_PASS -i me --qos 1 -t 'hermes/audioServer/default/audioFrame' -f /data/message.wav
# set volume on the speaker  to 70%
amixer set mysoftvol 70%

add a websocket node in node red, add a new listener: ws://rhasspy.local:12101/api/events/intent and the json parser as the next node. add a debug node to see what the payload actually looks like.

As you can see, there is a lot of room for improvement in the modification of the porcupine script that should be done.

use more variables in the script
Save the current volume before setting to 10%, so that we can restore the value
There is weird behaviour, where porcupine will only start recording after a successful recording, if you say the wakeword 3 times. Or you have to wait a bit. Not sure yet why that is exactly.

~I haven't figured out yet how to get the name of the speaker that was voice activated, back into node red. If I set the siteID to something else than default, the payload from websocket still contains default. That would be useful in order to just say: turn on the light and based on the siteID you can implement logic in node red to turn on the lights in the room of the speaker.~

EDIT: siteId corresponds to the name of the satellite, but it has to be added to the rhasspy config as well. Then you can use it in node red as well.

Once that is done, the rest is just config work in rhasspy (adding intents, or sentences to recognise) and the corresponding flows in node red.

Maybe that is of some help to other users out there. If you have any questions or remarks, feel free to comment.

hillbicks commented 1 year ago

Hey @duhow

I wanted to check in again and see how things are going? Have you made any progress with vosk? I basically abandoned my work on this shortly after the last post, the wakeword detection with porcupine was really not ready for a daily use.

Not sure if you follow the home assistant news, but they declared 2023 the year of the voice and also hired the rhasspy dev to work on better integration, but also to work on rhasspy. It seems they want to integrate rhasspy further with home assistant.

I tried to build porcupine 2.1 yesterday (2.0 included some bugfixes and imrovements), but it failed. I'll look into it a bit more this afternoon.

duhow commented 1 year ago

Hi there! Happy new year!

No, I did not invest much time to improve the voice assistance thing...

Porcupine new version very likely require additional changes to the new additions, what I'm doing is to use the "demo" software with a patch to "detect wakeword and exit", so that next program can continue with the recording. Having any other software replacing Porcupine that fits in the flash memory is ok, but I have no idea. And unfortunately most of the active projects are based in Python...

Ideally what the project should do is using Home Assistant API to do STT, or sending the audio directly to Rhasspy via Hermes MQTT as stream, so Rhasspy can confirm when the voice message has finished. Still, somewhat hard to achieve with Bash only, don't want to "overkill" with a C app, and Micropython might not be enough.

hillbicks commented 1 year ago

Thanks and a happy new year to you too!

Ah, yes, now the patch makes sense. I guess it is probably best to wait and see in which direction they're taking it with the different components and then continue the journey here.

As a starting point the 1.9 version of porcupine is good enough I'd say. Let's see where they're taking it.

duhow commented 1 year ago

Update status:

As of Home Assistant new services for Whisper, new commit 34bf45fcd8f1d0a7c9a5f1a6b6e9bdf4568613cb will allow to use it as Speech-to-text. Note this STT is still somewhat inacurate.

ℹ️ Get more details in Voice Assistant docs to setup.

There's still a lot of work to do both in Home Assistant intents, STT service and internal speaker Intents (TBD), so I encourage people to take more seriously home-assistant/intents to provide full functionality.

Closing this one, but feel free to provide PRs for improvements or requests. 😃

duhow / xiaoai-patch

Information gathering for voice assistant implementation via porcupine and vosk #13