argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 268 forks source link

Incorrect timestamps (0.5sec off) #105

Closed finnvoor closed 5 months ago

finnvoor commented 5 months ago

The timings of segments/words are sometimes inaccurate. When the attached audio is transcribed (we’re using base.en, but it seems to happen with larger models too), a lot of the segments have start times ~0.5sec after their actual start times. In the example, the word “Like” in “Like before…” should begin at 13.6s but WhisperKit is giving us 14.12s. This is happening for 5 out of the 10 segments in this audio.

I noticed that the segment contains a timing token with an accurate time of 13.6, but it uses 14.12 instead.

WhisperKit.TranscriptionSegment(….start: 14.12, end: 21.7, text: "<|13.04|><|13.60|> Like before, this balloon is still filled with mostly hydrogen. However, this time, about one third of it is oxygen.<|21.28|>”, ..., words: Optional([WhisperKit.WordTiming(word: " Like", tokens: [4525], start: 14.12, end: 14.44, probability: 0.8)

When word timestamps are disabled, the segment gets a start time of 13.04, which doesn't account for all the silence.

out.m4a.zip

atiorh commented 5 months ago

Thanks for the report @finnvoor! We started relying on the accuracy of word timestamps in streaming mode too. This is important, so we will triage and address it.

atiorh commented 5 months ago

Low-hanging fruits:

ZachNagengast commented 5 months ago

Quick update, I've identified the issue and am putting together a patch for this now.

atiorh commented 5 months ago

@finnvoor Please confirm that this fixes your issue 🙏

finnvoor commented 5 months ago

@ZachNagengast @atiorh gave it a quick test and the start times seem much more precise, thanks for the quick improvement.

It does seem like this has made the end times of words/segments slightly worse though. Previously, the end times would sometimes include some silence (be too late), but they never seemed to include any of the last word, so were good for splitting after a word/sentence. Now it seems like it accounts for silence at the end better, but seems to go a bit too far and includes the end of the word. In the same example at ~4s the word "gas" used to end at 4.06, now ends at 3.62, but should end at ~3.8.

Logic Pro - Untitled - Tracks@2x

We'll continue to test it a bit more today.

ZachNagengast commented 5 months ago

I see, good to know, we might be able to improve this with some VAD (shift the end time to the last point that the sound level was past a threshold), but this is also the same endpoint that openai/whisper gives for their word timestamps, so it might be a model issue, or need a bit more massaging to get perfect. There are many such so called "hacks" in the main repo that could be improved.

For detail: the reason it ended that far past the audio previously is because we were including the punctuation token ".", which has non-zero length, as part of the word's end time, the fix removed that time entirely, so it is ending exactly where it things the word "gas" ends, before the punctuation. Next step may be to consider some middle ground where the punctuation counts for some time but not the full token because it's not a spoken word. Open to ideas here too!

finnvoor commented 5 months ago

Got it, figured eventually we'd run into model limits. I think in our case I'll try just adding a small offset to the end since it seems pretty consistent, and in general adding silence is better than cutting words. VAD would be really nice but sounds a bit tricky to implement.