gexgd0419 / NaturalVoiceSAPIAdapter

Make Azure natural TTS voices accessible to any SAPI 5-compatible application.
MIT License
50 stars 4 forks source link

Any intention to make the use of embedded speech to text feature? #3

Open Meshwa428 opened 1 month ago

Meshwa428 commented 1 month ago

Microsoft Speech SDK also consists of the speech to text feature which is very good and has a very low WER(word error rate). You can use it via pressing win+h on your keyboard and a streaming speech to text feature will pop up on your screen.

It even works offline

You can find some help here

Meshwa428 commented 1 month ago

Here is a sample code provided by Microsoft

Here

gexgd0419 commented 1 month ago

You mean exposing embedded speech recogition models as SAPI 5 speech recognition (SR) engines?

Actually I'm interested in that too. I extracted the key for the recognition models, and did some experiments on my system to prove that it does work.

However, implementing a custom SAPI SR engine is more difficult than implementing a custom SAPI TTS engine.

ISpTTSEngine only requires implementing two methods: Speak and GetOutputFormat, while ISpSREngine plus ISpSREngine2 requires a lot more.

SAPI SR engines usually support not only dictation (speech to any text), but also recognizing voice commands defined by grammars. Embedded speech does support recognizing voice commands via intent recognition, which I suspect is what the new Voice Access feature on Windows 11 is based on. However, the grammar system in SAPI seems more complex and flexible than what IntentRecognizer can provide, which means that translating from SAPI grammars to IntentRecognizer patterns can be difficult, or sometimes impossible without losing some information.

This might be easier if I could just implement the dictation part. However, support for voice command recognition is required if you want to use the SR engine with the built-in Speech Recognition feature in Windows. It can do dictation, but only when what the user says does not match any of the supported voice commands.

Now I wonder, how many apps are actually utilizing SAPI speech recognition engines?

Meshwa428 commented 1 month ago

Whoaa, that seems more complex that implementing tts, but is there a way to bypass entering the api key to access models?

I know where the stt model's files are stored so it might be possible to load them directly from there 🤔

And then just try to replicate the grammar code from their code and get the job done. Because these models can just run very well on 100 to 200 MB ram which makes them very suitable for cards like Raspberry Pi, and the wer is also low which makes them high quality models.

Anyway, thanks for the insights 🙂.

gexgd0419 commented 1 month ago

I checked the documentation about pattern matching in intent recognition again.

The Speech SDK provides an embedded pattern matcher that you can use to recognize intents in a strict way. This is useful for when you need a quick offline solution.

So the pattern matching is completely offline? If the so-called "pattern matching" is just matching the recognized text, then I don't need to translate SAPI grammars to intent recognition patterns at all. I can use, for example, regular expressions to match the text.

Anyway, if you want to access the speech recognition models installed on your system, you can use the extracted keys in my source file. Keep in mind that the keys are not guaranteed to work forever.

Meshwa428 commented 1 month ago

Oh thanks for the keys, by the way I used the keys to run the models which came with Microsoft, and it seems that they are using different models here. Any method to download those speech recognition model provided by embedded speech?

I kind of want them so badly for my project 😅

I have been searching for months now but no success.

gexgd0419 commented 1 month ago

The keys are for the models that will be installed when you go to Windows Settings > Time & language > Language & region, open the language option for a language, then choose to install "Enhanced speech recognition". (By the way, the "Basic speech recognition" is for installing the older SAPI 5 speech recognition engines.)

You can get the paths for the installed models by using the following PowerShell:

Get-AppxPackage -Name MicrosoftWindows.Speech.* | Select-Object Name, InstallLocation

Or by code, using WinRT API PackageManager.FindPackagesForUser to get a list of all installed packages, then find the packages whose ID starts with MicrosoftWindows.Speech..

If you want a download link, you can find your installed "Speech Packs" in Microsoft Store's library. Then you can copy their Microsoft Store links. For example, here's the link to the English (US) speech pack. ("Speech Packs" are for speech recognition, not for Narrator's natural voices.)

If you want a more direct way to download, you can utilize something like store.rg-adguard.net to get direct download links to download the msix files, without requiring the Microsoft Store app or a Microsoft account. The downloaded msix files can be extracted to a folder just like a zip file, and then you should be able to use the folder as the model path.

Finally, if you want to know the official way to get the offline models, see this documentation about embedded speech. It requires you to submit an application form, and if you are eligible, you can get the model files with your own keys.

Meshwa428 commented 1 month ago

Great 😃 thanks buddy 🙏, appreciate your help