Incorrect word timestamp when using VAD

The following audio when transcribed using base.en, wordTimestamps: true, and chunkingStrategy: .vad outputs a word " [ Silence ]" with an incorrect timestamp (~30s offset from the surrounding words). I'm not sure if this is VAD related or just only occurs in this audio with VAD chunking enabled.

let pipe = try await WhisperKit(model: "base.en")
let result = try await pipe.transcribe(
    audioPath: "~/Downloads/Detail_20240606094159.m4a",
    decodeOptions: DecodingOptions(
        skipSpecialTokens: true,
        wordTimestamps: true,
        chunkingStrategy: .vad
    )
)
print(result.segments
    .compactMap(\.words)
    .flatMap { $0 }
    .map { "\($0.word), \($0.start), \($0.end)" }
    .joined(separator: "\n"))

outputs:

...
 scoop, 39.96, 40.36
 them, 40.559998, 40.62
 out., 40.62, 40.84
 [ Silence ], 70.36, 70.36
 Now,, 46.44, 46.5
 with, 46.859997, 46.899998
 some, 46.899998, 47.019997
...

Detail_20240606094159.m4a.zip

argmaxinc / WhisperKit

Incorrect word timestamp when using VAD #160