Open sevenlayercookie opened 3 weeks ago
I currently pass all audio files to FFMpeg for several reasons:
wav
but it's actually an mp3, it may still handle itwhisper.cpp
using stdin, not as a filewhisper.cpp
requires 16 kHz mono and vast majority of real-world inputs aren't in that format!), voice activity detection (can improve result significantly and shorten processing time for transcription) denoising, voice isolation, etc. so I can't just pass whisper.cpp
a fileThere are complex pipelines of operations here. Currently the processing is done mostly in-memory. Often there are actually multiple copies of the same waveform in memory in different sample rates or processing stages.
I have a wave encoder/decoder I wrote, which is used extensively internally. In the latest version (unreleased), I made it more portable so it could be run on more runtimes like Web / Deno / Bun etc. If I'd use it for wave files it could save a little bit of time compared to using ffmpeg, though not that much relative to the duration of the whole recognition operation. I'll also need to enhance it with more streaming operations and format conversions, etc. so it would be usable enough for it.
In the future I'll work on reducing memory use by using a streaming approach in more places, but that would only be truly significant, in practice, for very large waveforms, like 3 hours or more. That would be an overall project for itself, not just limited to only whisper.cpp
.
I currently pass all audio files to FFMpeg for several reasons:
FFMpeg detects file type, not just based on file extension, but also based on content, so if the file extension is
wav
but it's actually an mp3, it may still handle itI ask FFMpeg to encode the result directly in the floating point format I use internally for processing (32-bit float non-interleaved). Although I can do the conversion in JavaScript, that would be less memory efficient at the moment - but I'll improve that in future versions
FFMpeg can load files larger than 4GiB (large ArrayBuffers are supported in Node 22+), but that's now already fixed in the new version I'm working on, which supports effectively arbitrary file write and read lengths
I actually pass the input (encoded as wave) from memory to
whisper.cpp
using stdin, not as a fileI have preprocessing and analysis I may do, to the audio before passing to whisper.cpp, like sample rate conversion (
whisper.cpp
requires 16 kHz mono and vast majority of real-world inputs aren't in that format!), voice activity detection (can improve result significantly and shorten processing time for transcription) denoising, voice isolation, etc. so I can't just passwhisper.cpp
a fileThe API for recognition operations gives back the original audio as raw, as part of the return value. The CLI can also transcode the original to one or more output types
Recognition operations are also involved in several alignment engines, where having the original audio may be necessary even after the recognition is done
There are complex pipelines of operations here. Currently the processing is done mostly in-memory. Often there are actually multiple copies of the same waveform in memory in different sample rates or processing stages.
I have a wave encoder/decoder I wrote, which is used extensively internally. In the latest version (unreleased), I made it more portable so it could be run on more runtimes like Web / Deno / Bun etc. If I'd use it for wave files it could save a little bit of time compared to using ffmpeg, though not that much relative to the duration of the whole recognition operation. I'll also need to enhance it with more streaming operations and format conversions, etc. so it would be usable enough for it.
In the future I'll work on reducing memory use by using a streaming approach in more places, but that would only be truly significant, in practice, for very large waveforms, like 3 hours or more. That would be an overall project for itself, not just limited to only
whisper.cpp
.
I didn't realize all that was going in to it! Makes sense to have it loaded in memory for all of that. Unfortunately on my raspberry pi 4 with 8gb RAM, a 1.3 gb (11hr) WAV file crashes the entire system on the ffmpeg step, so I've been breaking them down into pieces and running echogarden on a loop for all the pieces. Of course the ffmpeg time is nothing compared to the transcription time, but it was just a thought. The streaming approach would be nice to solve this.
Getting everything streaming, either from disk (less likely), or say, a single in-memory copy of the audio in a compact form like 16-bit interleaved, would be great, but not very convenient to work with and pretty challenging to actually implement.
It's not going to make things faster, most likely would be a bit slower, but the memory requirement would be reduced significantly for very long audio (for short audio there wouldn't be much different).
There is some complexity in sample-rate conversion, for example, if I have an input that is 48000 Hz but I need 16000 Hz for processing (common for recognition), a streaming approach could just lookup a particular block of the audio and convert it as needed. Problem is that sample conversion usually needs a few more samples beyond that (usually before) so it will need to extract some extra samples beyond the block, which adds quite a bit of complexity.
It would require a lot of effort. The upcoming version actually does some steps towards that. Both for reducing memory requirements, and improving portability of the code to the Web / Deno / Bun (Internal usage of Node Buffer
objects has been eliminated, for example, which took many hours of work - as I needed to "polyfill" things like UTF-8, UTF-16, Base64, Hex encodings as well - it's all done now though).
I also removed reliance on Node.js streams almost completely, so I have custom methods to read and write from disk incrementally, which could help here. There is a lot of other work that needs to be done, it's an ongoing project.
Interesting you're actually running it on Raspberry PI which uses Linux for ARM - not a platform I've ever tested on or officially supported, but there's no reason why not really.
I wrote a brand new audio I/O addon (precompiled C++ addon via N-API) as well, which is intended to completely remove the dependence on sox
(currently problematic on macOS as it's really buggy there). sox
also requires a full in-memory copy of the audio to play - the new I/O based player streams from JavaScript memory directly to audio hardware and actually performs format conversions without memory overhead at all - already going in this path. I haven't tested or considered compiling the player it for ALSA on ARM. I'll see about that.
For transcription, I prefer to convert my source audio to whisper compatible .wav first and then run echogarden. However, when transcribing with echogarden, ffmpeg is called regardless of the source audio format (i.e. when the source audio is already in a compatible format)
Could you implement a check that looks first to see if source audio is in compatible format before performing a transcode?
That or include a flag that can disable ffmpeg altogether.