Open smoores-dev opened 3 months ago
The whipser
engine uses an ONNX export of the core encoder and decoder inference model, based on the reference OpenAI Python implementation. All the rest, including audio preprocessing, feature extraction, tokenization etc., is done in TypeScript and WebAssembly.
Support for DirectML (Windows only technology) was added to the ONNX runtime for Node.js (onnxruntime-node
) during the last few months. In the latest version, they also added support for CUDA on Linux only. I did add it as an option in the latest version (called cuda
provider) but the documentation wasn't fully updated to reflect that (partially because I didn't get to test that it works correctly on Linux).
The ONNX GPU support is currently unstable for some models though. Also, it performs poorly for decoder models like Whisper (and some LLMs), so it's enabled by default only for the Whisper encoder model (and that does improve performance significantly when a compatible GPU is available on Windows).
whisper.cpp
calls the main
executable for a whisper.cpp
binary. It does have word-level timestamps by default, though they are not very accurate. The enableDTW
option is an experimental internal whisper.cpp
feature that was added in recent versions (it's called --dtw
and not related to --max-len
). I have no part in it. It was written in C++ and contributed by the whisper.cpp
community. From my tests a few months ago it was too buggy to be usable (though I managed to workaround some of its bugs) and not accurate enough for me to recommend (it had a consistent lag of 100ms or more if I recall correctly).
The whisper.cpp
engine just calls a command-line executable. You can specify any executable path you want in whisperCpp.executablePath
, including builds with support for OpenCL or CoreML:
I didn't remember exactly how --max-len
was passed but the code shows this logic:
if (!options.enableGPU) {
args.push(
'--no-gpu'
)
}
if (options.enableDTW) {
args.push(
'--max-len',
'0',
'--dtw',
modelName.replaceAll('-', '.'),
)
} else {
args.push(
'--max-len',
'0',
)
}
Basically it always passes --max-len 0
, and extracts the word events from the result it gives. So the "experimental" whisper.cpp
word timestamps are always enabled (its recent DTW feature --dtw
is not related).
Actually, I'm also doing the conversion of the tokens returned by whisper.cpp
in TypeScript (using the tiktoken
library), which allows me to work around some issues with the native de-tokenized output returned by whisper.cpp
and ensure it works correctly (it shares some tokenizer code with the TypeScript implementation).
I just wanted to double check something that I'm seeing while reading the docs.
It looks like there are two Whisper-based recognition engines:
whisper
, which is a Typescript implementation of Whisper that seems to support word-level timestamps, but only supports GPU via DirectML, which is Windows-only.whisper.cpp
, which is using the whisper.cpp CLI, which has CUDA support (but maybe not OpenCL support anymore? It's not 100% clear), but maybe doesn't have word-level timestamps? It's not 100% clear to me whether theenableDTW
flag enables word-level (or maybe just token level?) timestamps or just improved segment-level timestamps; it also seems like echogarden doesn't expose whisper.cpp'smax-len
flag, which enables the built-in "experimental" word-level timestamps.Basically what I'm trying to figure out is whether there's any way to have both word-level timestamps and CUDA support (and ideally OpenCL but that's optional I suppose!) with Whisper transcription on Linux.