Considering that this code already exists somewhere in the guts of the browser, it is pretty silly to compile it separately into WebAssembly. Unfortunately, there isn't actually any API to access it from JavaScript, so we are stuck having to do our own VAD for endpointing.
The problem with the WebRTC code used in PocketSphinx5 is:
Computation is done in fixed-point, so we have to convert back and forth between Float32
WebAudio doesn't let us choose our buffer size, and neither does the VAD, so we have to implement a ring-buffer (we have to do this anyway, but...)
WebAudio can already do an FFT for us, more efficiently, but the AnalyzerNode API is utter garbage designed only for making pretty pictures, so never mind
For these reasons the ideal solution is, horror of horrors, something very much like the -remove_silence option in PocketSphinx that was the whole reason for creating SoundSwallower in the first place (because I was so seriously annoyed at it removing data from the input, making force-alignment useless). Of course, it has to be done in a way that makes endpointing optional and doesn't break the batch-mode API. So, specifically:
Encapsulate input features (MFCCs, but not necessarily) for the decoder
Create a fused feature extractor and endpointer which emits speech start/stop events and feature buffers, with timestamps
Internally we can either use the WebRTC method based on log-spectra or the PocketSphinx 5prealpha method.
Considering that this code already exists somewhere in the guts of the browser, it is pretty silly to compile it separately into WebAssembly. Unfortunately, there isn't actually any API to access it from JavaScript, so we are stuck having to do our own VAD for endpointing.
The problem with the WebRTC code used in PocketSphinx5 is:
For these reasons the ideal solution is, horror of horrors, something very much like the
-remove_silence
option in PocketSphinx that was the whole reason for creating SoundSwallower in the first place (because I was so seriously annoyed at it removing data from the input, making force-alignment useless). Of course, it has to be done in a way that makes endpointing optional and doesn't break the batch-mode API. So, specifically:Internally we can either use the WebRTC method based on log-spectra or the PocketSphinx 5prealpha method.