optimization: Improve the audio pipeline latency

ndrean commented 4 months ago

The code takes a lot of care to downsize the image to improve the ML latency.

But we (I) did not treat the audio with much care. This could be a next step to improve the latency.

For example, going from stereo to mono, changing the sampling rate down to 16kHz.

Using ffmpeg:

Perhaps Javascript offers some solutions. To explore

LuchoTurtle commented 4 months ago

Interesting! I wouldn't have thought of that, awesome!

I don't know the impact of reducing stereo to mono and reducing the sampling rate will have on the audio and if it will "lose information" before it's fed into the model but it's a really interesting topic 👀

ndrean commented 3 months ago

@LuchoTurtle

So of course you can do this with Javascript with the web audio API.

Some motiivation here.

TLTR;

You define an AudoiContext and pass the sample rate you want. Since a PC microphone usually samples at 48_000 Hz, you pass sampleRate: 16_000 to downsize it.
You use decodeAudoiData which expects an ArrayBufferwhich you get from a Blob with the method arrayBuffer().
You can average on the (microphone) channels to get a mono sound, thus lower the audio size to be treated by the ML model. Since PC microphones are usually mono, I do not treat this case here.

The code uses a tiny library (import toWav from "audiobuffer-to-wav") to convert an audio ArrayBuffer into a WAV file.

// Add "stop" event handler for when the recording stops.
          mediaRecorder.addEventListener("stop", async () => {
            const audioBlob = new Blob(audioChunks);
            // update the source of the Audio tag for the user to listen to his audio
            audioElement.src = URL.createObjectURL(audioBlob);

            // create an AudioContext with a sampleRate of 16000
            const audioContext = new AudioContext({ sampleRate: 16000 });

            // async read the Blob as ArrayBuffer to feed the "decodeAudioData"
            const arrayBuffer = await audioBlob.arrayBuffer();
            // decodes the ArrayBuffer into the AudioContext format
            const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
            // converts the AudioBuffer into a WAV format
            const wavBuffer = toWav(audioBuffer);
            // builds a Blob to pass to the Phoenix.JS.upload
            const wavBlob = new Blob([wavBuffer], { type: "audio/wav" });
            // upload to the server via a chanel with the built-in Phoenix.JS.upload
            _this.upload("speech", [wavBlob]);
            //  close the MediaRecorder instance
            mediaRecorder.stop();
            // cleanups
            audioChunks = [];
            recordButton.classList.remove(...pulseGreen);
            recordButton.classList.add(...blue);
          });
        });

‼️ Note that this was already solved by the Bumblebee team.... https://github.com/elixir-nx/bumblebee/blob/main/examples/phoenix/speech_to_text.exs

Note that they take care of the chip endianess and average on the (mic) channels. I think we can safely just use the code above and not treat these cases because (I believe almost all) PC chips use "little" endianess.

I can do a PR with the ocde above if you want, but it's up to you if you use instead the full code given in the Bumblebee example above.

LuchoTurtle commented 3 months ago

It seems like a good PR. You can open one if you want to, additions are always welcome! I think this is an adequate optimization, it will ease some load from the server :D

I'm working on a small PR to fix the deploying issues and the volume not being used (the path was wrong) - so now models will always be loaded from it without having to re-download after it's inactive. We were wrong on volumes wiping out data, they don't. :)

ndrean commented 3 months ago

OK. #81

Note that I did not push further the optimisation to convert a WAV into an MP3.

I believe this is out of scope for this application as the main idea is to suggest possible optimisation on how to reduce ML processes latency.

Producing a complete mono 16kHz sampled MP3 audio file (optimized in size for ML) will require to use lamejs.

LuchoTurtle commented 3 months ago

Closed with the merge of #81

ndrean commented 3 months ago

@LuchoTurtle

Check this: https://github.com/openai/whisper/discussions/870

We do it client-side, so we save some CPU, so it is not bad. I suggest we keep this.

dwyl / image-classifier

optimization: Improve the audio pipeline latency #80