emexlabs / WearableIntelligenceSystem

Wearable computing software framework for intelligence augmentation research and applications. Easily build smart glasses apps, relying on built in voice command, speech recognition, computer vision, UI, sensors, smart phone connection, NLP, facial recognition, database, cloud connection, and more. This repo is in beta.

MIT License

110 stars 23 forks source link

ASR on ASP #11

Closed CaydenPierce closed 2 years ago

CaydenPierce commented 2 years ago

Part of the move away from streaming sensor data over the internet.

Relies on implementation of #10

ASR on Android

We need to be able to transcribe text locally. ASR takes a lot of RAM, compute, and battery, so it's not realistic to do on the ASG. Streaming audio 8 hours a day every day to the internet takes too much data. This means it must happen on the ASP.

After considerable research by yours truly it seems that the best option to do this in Android is Vosk: https://github.com/alphacep/vosk-api

TODO (after completion of #10)

[x] get Vosk Android API demo working and test locally on ASP: https://github.com/alphacep/vosk-android-demo
[x] pull Vosk Android libs into ASP app
[x] run Vosk on incoming audio stream from ASG and receive transcriptions
[x] make transcriptions available to rest of ASP application
[x] stream transcriptions to ASG

nshmyrev commented 2 years ago

ASR takes a lot of RAM, compute, and battery, so it's not realistic to do on the ASG.

It can be optimized. What are the available resources on ASG?

CaydenPierce commented 2 years ago

Hey @nshmyrev , thanks for checking this out.

ASG - Android Smart Glasses

The current ASG hardware is a Vuzix Blade with specs:

Quad Core ARM Cortex-A53
1GB RAM
Android 5.1.1

Important considerations:

battery life - this is paramount. Whatever's cheaper on battery (streaming audio over WiFi or BT vs ASR) will probably win
ASR accuracy/WER - this is used for live conversations in noisy environments - we were using Google because DeepSpeech just wasn't accurate enough - realize that Vosk is achieving significantly better performance with larger models, and hoping to use larger models on ASP (Android Smart Phone)

nshmyrev commented 2 years ago

The current ASG hardware is a Vuzix Blade with specs:

Well, you definitely can run at least keyword activation on that. Something like https://github.com/ARM-software/ML-KWS-for-MCU should help and take very few resources. The rest depends on the app if you want to recognize just a few commands or more serious queries.

ASR accuracy/WER - this is used for live conversations in noisy environments - we were using Google because DeepSpeech just wasn't accurate enough - realize that Vosk is achieving significantly better performance with larger models, and hoping to use larger models on ASP (Android Smart Phone)

Ok, if you need help on this let me know. I wanted to work with Vuzix on that but they never responded to my queries somehow.

CaydenPierce commented 2 years ago

Ok, if you need help on this let me know. I wanted to work with Vuzix on that but they never responded to my queries somehow.

Great, thanks, we could certainly use some help in terms of getting highest possible accuracy/WER.

Since we will be streaming audio from ASG to ASP either way, it makes sense battery-wise and compute-wise to do ASR on ASP.

I've seen incredible results with the vosk-model-en-us-0.22 and good results with the vosk-model-small-en-us-0.15. How reasonable would it to get the larger model (or something in between) with better accuracy going on a modern Android smart phone?

We're also looking into better sensors - the microphone used has a drastic effect on the WER. Have you tried different mics and found any that are ideal?

Happy to move this to an issue on Vosk repo. First priority is getting the whole pipeline working, but we'll soon want to optimize.

Thanks @nshmyrev

CaydenPierce commented 2 years ago

Tested both vosk-model-en-us-0.22 and vosk-model-en-us-0.22-lgraph from https://alphacephei.com/vosk/models on Android, and both work! This is on my private fork which will be merged in the next few days.

The larger model vosk-model-en-us-0.22 wouldn't build with Gradle (OOM, even with 8gb build allowance). But build works in Bazel. It takes 10 minutes for vosk-model-en-us-0.22 though, so will want to make this something to download separate from the APK.

@nshmyrev

nshmyrev commented 2 years ago

The larger model vosk-model-en-us-0.22 wouldn't build with Gradle (OOM, even with 8gb build allowance). But build works in Bazel. It takes 10 minutes for vosk-model-en-us-0.22 though, so will want to make this something to download separate from the APK.

Big model is certainly not for Android. The lgraph version should be ok.

CaydenPierce commented 2 years ago

Successful for all steps. Vosk is very high quality ASR, even with the small model.

A few things we need and will follow up with in future issues:

omnidirectional microphone that picks up far-field data - we want to be able to transcribe conversations, but right now we are only transcribing the user because the microphone on the Vuzix Blade is directional and near-field
custom vocabulary for wake words and commands
contextual ASR - using the context and the user's vocabulary to judge what is more likely ASR output amongst a probability distribution of outputs
english grammar ASR - using proper grammar / making linguistic sense to judge the likely ASR output amongst a probability distribution of outputs