GPU acceleration + word level timestamps

I just wanted to double check something that I'm seeing while reading the docs.

It looks like there are two Whisper-based recognition engines:

whisper, which is a Typescript implementation of Whisper that seems to support word-level timestamps, but only supports GPU via DirectML, which is Windows-only.

whisper.cpp, which is using the whisper.cpp CLI, which has CUDA support (but maybe not OpenCL support anymore? It's not 100% clear), but maybe doesn't have word-level timestamps? It's not 100% clear to me whether the enableDTW flag enables word-level (or maybe just token level?) timestamps or just improved segment-level timestamps; it also seems like echogarden doesn't expose whisper.cpp's max-len flag, which enables the built-in "experimental" word-level timestamps.

Basically what I'm trying to figure out is whether there's any way to have both word-level timestamps and CUDA support (and ideally OpenCL but that's optional I suppose!) with Whisper transcription on Linux.

The whipser engine uses an ONNX export of the core encoder and decoder inference model, based on the reference OpenAI Python implementation. All the rest, including audio preprocessing, feature extraction, tokenization etc., is done in TypeScript and WebAssembly.

Support for DirectML (Windows only technology) was added to the ONNX runtime for Node.js (onnxruntime-node) during the last few months. In the latest version, they also added support for CUDA on Linux only. I did add it as an option in the latest version (called cuda provider) but the documentation wasn't fully updated to reflect that (partially because I didn't get to test that it works correctly on Linux).

The ONNX GPU support is currently unstable for some models though. Also, it performs poorly for decoder models like Whisper (and some LLMs), so it's enabled by default only for the Whisper encoder model (and that does improve performance significantly when a compatible GPU is available on Windows).

whisper.cpp calls the main executable for a whisper.cpp binary. It does have word-level timestamps by default, though they are not very accurate. The enableDTW option is an experimental internal whisper.cpp feature that was added in recent versions (it's called --dtw and not related to --max-len). I have no part in it. It was written in C++ and contributed by the whisper.cpp community. From my tests a few months ago it was too buggy to be usable (though I managed to workaround some of its bugs) and not accurate enough for me to recommend (it had a consistent lag of 100ms or more if I recall correctly).

The whisper.cpp engine just calls a command-line executable. You can specify any executable path you want in whisperCpp.executablePath, including builds with support for OpenCL or CoreML:

Screenshot_1

I didn't remember exactly how --max-len was passed but the code shows this logic:

        if (!options.enableGPU) {
            args.push(
                '--no-gpu'
            )
        }

        if (options.enableDTW) {
            args.push(
                '--max-len',
                '0',

                '--dtw',
                modelName.replaceAll('-', '.'),
            )
        } else {
            args.push(
                '--max-len',
                '0',
            )
        }

Basically it always passes --max-len 0, and extracts the word events from the result it gives. So the "experimental" whisper.cpp word timestamps are always enabled (its recent DTW feature --dtw is not related).

Actually, I'm also doing the conversion of the tokens returned by whisper.cpp in TypeScript (using the tiktoken library), which allows me to work around some issues with the native de-tokenized output returned by whisper.cpp and ensure it works correctly (it shares some tokenizer code with the TypeScript implementation).

echogarden-project / echogarden

GPU acceleration + word level timestamps #62