The following audio when transcribed using base.en, wordTimestamps: true, and chunkingStrategy: .vad outputs a word " [ Silence ]" with an incorrect timestamp (~30s offset from the surrounding words). I'm not sure if this is VAD related or just only occurs in this audio with VAD chunking enabled.
Hi @finnvoor apologies for the delay, we are tracking these recent VAD issues and will be rolling them into the next release which will be focused on correctness, memory, and energy use optimizations.
The following audio when transcribed using base.en, wordTimestamps: true, and chunkingStrategy: .vad outputs a word " [ Silence ]" with an incorrect timestamp (~30s offset from the surrounding words). I'm not sure if this is VAD related or just only occurs in this audio with VAD chunking enabled.
outputs:
Detail_20240606094159.m4a.zip