emexlabs / WearableIntelligenceSystem

Wearable computing software framework for intelligence augmentation research and applications. Easily build smart glasses apps, relying on built in voice command, speech recognition, computer vision, UI, sensors, smart phone connection, NLP, facial recognition, database, cloud connection, and more. This repo is in beta.
MIT License
110 stars 23 forks source link

ASR on ASP #11

Closed CaydenPierce closed 2 years ago

CaydenPierce commented 2 years ago

Part of the move away from streaming sensor data over the internet.

Relies on implementation of #10

ASR on Android

We need to be able to transcribe text locally. ASR takes a lot of RAM, compute, and battery, so it's not realistic to do on the ASG. Streaming audio 8 hours a day every day to the internet takes too much data. This means it must happen on the ASP.

After considerable research by yours truly it seems that the best option to do this in Android is Vosk: https://github.com/alphacep/vosk-api

TODO (after completion of #10)

nshmyrev commented 2 years ago

ASR takes a lot of RAM, compute, and battery, so it's not realistic to do on the ASG.

It can be optimized. What are the available resources on ASG?

CaydenPierce commented 2 years ago

Hey @nshmyrev , thanks for checking this out.

ASG - Android Smart Glasses

The current ASG hardware is a Vuzix Blade with specs:

Important considerations:

nshmyrev commented 2 years ago

The current ASG hardware is a Vuzix Blade with specs:

Well, you definitely can run at least keyword activation on that. Something like https://github.com/ARM-software/ML-KWS-for-MCU should help and take very few resources. The rest depends on the app if you want to recognize just a few commands or more serious queries.

ASR accuracy/WER - this is used for live conversations in noisy environments - we were using Google because DeepSpeech just wasn't accurate enough - realize that Vosk is achieving significantly better performance with larger models, and hoping to use larger models on ASP (Android Smart Phone)

Ok, if you need help on this let me know. I wanted to work with Vuzix on that but they never responded to my queries somehow.

CaydenPierce commented 2 years ago

Ok, if you need help on this let me know. I wanted to work with Vuzix on that but they never responded to my queries somehow.

Great, thanks, we could certainly use some help in terms of getting highest possible accuracy/WER.

Since we will be streaming audio from ASG to ASP either way, it makes sense battery-wise and compute-wise to do ASR on ASP.

I've seen incredible results with the vosk-model-en-us-0.22 and good results with the vosk-model-small-en-us-0.15. How reasonable would it to get the larger model (or something in between) with better accuracy going on a modern Android smart phone?

We're also looking into better sensors - the microphone used has a drastic effect on the WER. Have you tried different mics and found any that are ideal?

Happy to move this to an issue on Vosk repo. First priority is getting the whole pipeline working, but we'll soon want to optimize.

Thanks @nshmyrev

CaydenPierce commented 2 years ago

Tested both vosk-model-en-us-0.22 and vosk-model-en-us-0.22-lgraph from https://alphacephei.com/vosk/models on Android, and both work! This is on my private fork which will be merged in the next few days.

The larger model vosk-model-en-us-0.22 wouldn't build with Gradle (OOM, even with 8gb build allowance). But build works in Bazel. It takes 10 minutes for vosk-model-en-us-0.22 though, so will want to make this something to download separate from the APK.

@nshmyrev

nshmyrev commented 2 years ago

The larger model vosk-model-en-us-0.22 wouldn't build with Gradle (OOM, even with 8gb build allowance). But build works in Bazel. It takes 10 minutes for vosk-model-en-us-0.22 though, so will want to make this something to download separate from the APK.

Big model is certainly not for Android. The lgraph version should be ok.

CaydenPierce commented 2 years ago

Successful for all steps. Vosk is very high quality ASR, even with the small model.

A few things we need and will follow up with in future issues: