Support speech commands using Web Speech API, a dense enough neural net, or a combination of both

Some time ago, an experimental implementation of the Web Speech API was added (some further additions here). Later it was removed from the build again, because some components were not optimally written and had negative impact on performance and code complexity. These issues were not grave but the ease of use of the speech commands was not enough to warrant either keeping them unaddressed or putting in the work to solve them.

There are 2 main concerns:

The Web Speech API relies on a built in browser feature that sends all your voice input over the wire and back to some barely documented service, leading to security, privacy, and performance concerns.
It's still (ambiguity related) error prone and hard to make it correctly capture a certain subset of words. Even though the spec describes a mechanism to restrict to a grammar, I found that no browser actually implements this.

Even with these concerns, the results were not that bad. With some hacking and deliberate choice of non-clashing command words it was usable enough to be viable as an approach, at least for some subset of possible commands. Connecting this to the way application state is handled in this repo, in a general way, went surprisingly well, and you can in theory control any state in this way. Even multiple commands in a single sentence, and ignoring irrelevant words in between, went better than expected.

Using a neural network

Both these concerns can be addressed by using a local neural network instead, which can be loaded into memory of the browser directly (or perhaps a service worker).

Using a network from an open source library

I found this library which has multiple sizes of network, currently their demo has from ~50MB to ~1GB sized networks: https://github.com/FL33TW00D/whisper-turbo.

The nice thing about it is it uses the GPU, which is definitely under-utilized by the browser a lot of the time. Though it should also be able to "stay out of the way" whenever the browser does need the GPU.

I don't know what work is involved into using such a network in a more complex use case, but at least on the demo site the results look very promising. The time it takes to load the biggest net was not a deal breaker, and some people report good results even on the smaller networks.

If the vocabulary of commands is small enough, it might be possible to use the smaller networks, which should definitely be fine memory-wise. Many complex browser apps use much more memory than this.

If using a large amount of memory doesn't cause too much problems, there may also be benefit in using the larger ones.

Creating a custom neural network

It might have some benefits to train a network specifically for the commands included in this repo, which should definitely result in a smaller size for similar performance. However I have no experience with training such network, let alone maintaining it over time.

Inwerpsel / use-theme-editor