WebAssembly build (32-bit)

ccoreilly commented 3 years ago

Good morning! First of all thank you very much for the efforts you are putting in reverse engineering and open sourcing the original snowboy library, those are impressive skills!

I have wanted a snowboy WASM build running in the browser for a while and now with the source code it seems it might be possible. I have made an attempt you can find in my fork of the project but although it builds and runs detection on the audio data sent to it, it always returns silence (-2).

I put up a demo you can try out here (it unfortunately only works with Chrome as Firefox does not support retrieving user audio at 16000 Hz and resampling in the browser would add another step where things might go wrong, I would like to focus on getting it to work now)

In order to make it build I had to comment out several self_assert statements that assume a 64-bit architecture and I found an issue stating that only 32bit ARM is supported.

I assume thus that the WebAssembly build might not work due to some math assuming 64 bits? I am a bit "überfragt" :) so I was hoping you could shine some light on the issue or changes needed to make it work.

Thanks!

Thalhammer commented 3 years ago

Webassembly is definitly something I would have looked into as well in the future. It honestly does not surprise me a lot that it did not work out of the box, because when reversing you have to deal with the result the compiler optimized on certain assumptions about the target, many of which probably don't hold in the browser.

There are a couple of things I found:

Your fork is 4 commits behind. I did a lot of refactoring/bugfixes in those, so it would be good if you could retry it with the most uptodate code.
Instead of using the microphone, record a short wave file and feed that in. In addition to making it easier to test, it also allows for repeatable testing (if it works once with a given wave it should always produce the same result). Ideally check if the wave gets correctly detected on the local code.
I only skimmed the audio code, but I noticed that you set AudioGain to 5. Thats almost certainly wrong. An audiogain of 5 means the audiosignal is multiplied by 5 before processing, so if it works with such a high gain that probably means there is an issue somewhere else that just gets mitigated such a high gain.

In order to make it build I had to comment out several self_assert statements that assume a 64-bit architecture

Thats a result of the way how I reverse the library. I had to make sure the layout of my code matches the layout used during compilation of the orriginal code, while I mixed new and original code. However this is no longer the case and if you update to the latest commit you will notice they are gone.

and I found an issue stating that only 32bit ARM is supported.

I am not sure why they decided to not support 32bit x86, but I assume it was mostly a decision to cut down on support work, given 32bit x86 is effectively dead. I dont think there is anything in the code thats inherently 64 bit only. That being said: The current state of the code sometimes cast a pointer to a number and adds to it, which depends on the size of the pointed to type. I dont think I came across any of those being 64bit, but that might be a culprint. I have never compiled the code for 32bit yet, but I certainly intend on supporting it in the future.

I assume thus that the WebAssembly build might not work due to some math assuming 64 bits?

Most of the math is floating point maths anyway (mostly 32bit float, but in some parts double).

The steps to go further in this direction would probably be

compiling the code and unit tests in 32bit mode and making sure they produce the same results.
Actually implementing a test using live microphone input (or pieces of a wave) cause it might well be that detection works fine if the recording is supplied in one junk but not if its streamed (e.g. because I screwed up buffering somewhere).

Steps for debugging are probably dropping debug outputs in the pipeline and comparing the output between browser and "normal" build. I also need to build more unit tests cause I am like 99% sure there lurk many missed bugs in the code. TBH I never worked with wasm (apart from reading&likeing it and coding the obligatory hello world) so I dont really know what would the typical way of debugging it.

I assume to get it running in the browser you also need to crosscompile lapack & blas libraries. Snowboy only uses cblas in some special cases, all of which are contained in the Matrix&Vector classes. I intend to add discreet implementations of them as a build option, making the dependency to blas optional. The default would still use them, but disabling it would probably save lots of executable size which is good for both embedded and browser.

I also hate that snowboy/snowman does IO directly to disk in many places (Input&Output classes) deep inside the library, which makes it effectively impossible to use without filesystem support and prevents embedding the ressources&model inside the executable. I dont have a good plan how to change that yet without completely breaking API compatibility.

My guess would be that something inside the vad is broken. If it assumes the input audio is all silence it skips everything afterwards to save performance. But then again, there was a pretty major bug in Nnet, so maybe upgrading to the latest version already fixes it.

Its nice to see people actually likeing it and intending to use it. Lets hope it gets in a usable state soon.

Sincerely, Thalhammer

Thalhammer commented 3 years ago

I just compiled the whole library for 32bit (on linux you can just install the 32bit version of all libs and add -m32 to the compilation. To my surprise it actually worked fine for the most part. Some models failed to load because some of some io being the wrong size and the enroll tests failed because the resulting matrix hash was off by 1, which is probably just a result of some of the round being different, so nothing to worry about (except that I need to check my test cases. All of the detection worked out of the box. But that said, I do compile against prebuilt cblas, so that might help.

ccoreilly commented 3 years ago

Thank you for the prompt reply and the hints.

I followed your advice to feed it a wav file and it actually works! It is thus an issue with how and what data I was passing to the library. I will look into it later on to see if I can get a working example with the microphone.

You can try it out here with the audio_samples from the repository.

I will merge the upstream changes and clean it up a bit, would you be interested in a PR? I don't have much WebAssembly experience either but I can look out for some testing frameworks for it and try to write some tests.

ccoreilly commented 3 years ago

I forgot to ask, are the wav headers expected by RunDetection or does it suffice to pass the audio data?

Thalhammer commented 3 years ago

I forgot to ask, are the wav headers expected by RunDetection or does it suffice to pass the audio data?

It is actually expected to only feed the data without headers. If you feed the headers it could screw up the detection (but probably wont). Its a relic of the original api and I honestly hate it cause its quite missleading and pretty much useless. I will mark it as deprecated at some point or outright remove it.

I will merge the upstream changes and clean it up a bit, would you be interested in a PR? I don't have much WebAssembly experience either but I can look out for some testing frameworks for it and try to write some tests.

Of course I'll happily take a pr as long as its well formated.

It is thus an issue with how and what data I was passing to the library.

Either that or something to do with buffering inside the library. However I did a quick unit test feeding it 1024 byte chunks and it seemed to work correctly in normal builds for the universal pipeline. Personal failed, but that might have some other issue.

Just a quick guess: When using the microphone (and assuming the 0=x 1023=x values mean the sample value) the values seem to be all over the place even if you dont speak, when they should be pretty close to zero (absolute value).

EDIT: Theres definitly some bug in the handling code, since for certain chunksizes it outright segfaults ¯_(ツ)_/¯.

Thalhammer commented 3 years ago

Ok so the library now handles chunked audio gracefully, however I noticed during my testing that chunksizes below ~4000 samples sometimes causes it to not recognize the hotword. I am still cleaning up and improving things, so make sure to regularly update your fork. I really like the idea of having a working webasm build, as it would make for a really cool demo of the project.

Thalhammer commented 3 years ago

I did some more testing using webassembly and it seems to be working now. To my surprise the performance is actually pretty good (~4% cpu load on a single core of a AMD Ryzen 7 3800X). I took most of the changes from your work and transfered them into the current master branch as well as doing some additional changes to the js and the build system (it does not require having emscripten installed anymore but builds inside the docker container).

I am no JS expert, so the following might be dumb, so feel free to correct me. I removed the service worker and shared memory, which grately improves the readability. I do see why it was there, but I think the shared memory screwed up the audio signal causing it to fail. I currently use postMessage to send the audio from the audio processor to the main window and run detection there. I'd rather run the detection directly in the audio-processor and only post something if the result changes (e.g. from silence to voice) instead of for every frame, however I dont know how to import the snowboy_wasm.js file into the audio-processor.js or push the created instance to it. If thats possible it should cut another percent or so from the cpu load (or about 25% of the current usage). Also from what I've read calling into webassembly is rather expensive, so it might make sense to buffer a couple of frames and send them in one go, since the buffer size of 128 samples provided to the audioprocessor is rather small and does not really bring any improvement over something like 1280 since the library does a fair amount of buffering internally anyway. In my pulseaudio examples I use a chunksize of 100ms (1600 samples) which works quite well and feels instant. Another thing is the conversion from the provided Float32Array to Int16 for the library. Might be worth exporting the other overloads of RunDetection since the float overload should actually do exactly the same conversion.

I am open to feedback on the thing :)

ccoreilly commented 3 years ago

That's great! I am sorry but I haven't had much time to invest in this lately.

I agree that using an AudioWorklet and a Worker with a SharedArrayBuffer might be an overkill but I don't think doing detection in the main thread might be wise as it could block the UI rendering. For the example it is completely fine as there is not much UI rendering done but in other applications it could be an issue. I think the best approach would be what you mentioned of doing everything in the AudioWorklet but the only way I know how to achieve this is to concatenate both files (snowboy_wasm.js and audio-processor.js) as it is not possible to use importScript in an AudioWorklet nor to pass an instance through the postMessage.

As for converting to Int, I also agree. I think I tried so many things when I wasn't managing to make it work that I left it as is it now when it worked but initially I was using the float overload.

I'll have a deeper look later. Thanks again for the great job you're doing with this library!

Thalhammer commented 3 years ago

So shortly after writing that I found out that appearently you can just Import stuff Info the Audio worker. Take a look at this example which does pretty much exactly what we need:

https://googlechromelabs.github.io/web-audio-samples/audio-worklet/design-pattern/wasm/

Not sure how recent the browser would need to in order to support this, but might be worth looking into.

EDIT: I played around a bit more with that approach and it seems to work in theory, however you cant download files and indexeddb (which is used by emscripten for its FS) is not available. One could probably get it to work, but I dont think its worth the effort right now, especially given that according to chromes dev tools the detection requires about 1.4% cpu (on my machine), so its pretty much free. Since it also does not matter if it gets delayed by ui actions, as well as the time of chunks being small enough to not delay the ui, theres probably little reason to do much extra to get it working inside the audio-processor. However this changes once I remove the IO from inside the library (which is on the todo list anyway), at which point I might revisit it.

Arnaudv6 commented 5 months ago

Context information:
Snowman demo app works like a charm, and enrolling is nice-to-have feature! Thanks!
Alas I ended up spending too much time trying to make snowman work on our angular app.
I hoped compiling snowman with a recent emscripten toolchain might help with JS dependencies...
But then I found bumblebee, which is turnkey on angular... and left.
(Pico-voice/porcupine is no-go for us: project no-go on licence fees).

Point is:
Just so you know, a few months after you gave us the snowman webasm build, kaldi merged this: https://github.com/kaldi-asr/kaldi/pull/4273 and referenced this build in their doc: https://github.com/kaldi-asr/kaldi?tab=readme-ov-file#web-assembly
with builds ready for download: https://gitlab.inria.fr/multispeech/kaldi.web/clapack-wasm
https://gitlab.inria.fr/multispeech/kaldi.web/kaldi-wasm/-/releases
(Their wiki is great too.)

sveinbjornt commented 5 months ago

Just wanted to point to this: https://github.com/musistudio/wasm-snowboy

Have run in browser and can confirm it works.

Thalhammer / snowman

WebAssembly build (32-bit) #8