argmaxinc / WhisperKit

On-device Inference of Whisper Speech Recognition Models for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
2.85k stars 235 forks source link

Incorrect word timestamp when using VAD #160

Open finnvoor opened 3 weeks ago

finnvoor commented 3 weeks ago

The following audio when transcribed using base.en, wordTimestamps: true, and chunkingStrategy: .vad outputs a word " [ Silence ]" with an incorrect timestamp (~30s offset from the surrounding words). I'm not sure if this is VAD related or just only occurs in this audio with VAD chunking enabled.

let pipe = try await WhisperKit(model: "base.en")
let result = try await pipe.transcribe(
    audioPath: "~/Downloads/Detail_20240606094159.m4a",
    decodeOptions: DecodingOptions(
        skipSpecialTokens: true,
        wordTimestamps: true,
        chunkingStrategy: .vad
    )
)
print(result.segments
    .compactMap(\.words)
    .flatMap { $0 }
    .map { "\($0.word), \($0.start), \($0.end)" }
    .joined(separator: "\n"))

outputs:

...
 scoop, 39.96, 40.36
 them, 40.559998, 40.62
 out., 40.62, 40.84
 [ Silence ], 70.36, 70.36
 Now,, 46.44, 46.5
 with, 46.859997, 46.899998
 some, 46.899998, 47.019997
...

Detail_20240606094159.m4a.zip

ZachNagengast commented 1 week ago

Hi @finnvoor apologies for the delay, we are tracking these recent VAD issues and will be rolling them into the next release which will be focused on correctness, memory, and energy use optimizations.