Closed finnvoor closed 5 months ago
Thanks for the report @finnvoor! We started relying on the accuracy of word timestamps in streaming mode too. This is important, so we will triage and address it.
Low-hanging fruits:
Quick update, I've identified the issue and am putting together a patch for this now.
@finnvoor Please confirm that this fixes your issue 🙏
@ZachNagengast @atiorh gave it a quick test and the start times seem much more precise, thanks for the quick improvement.
It does seem like this has made the end times of words/segments slightly worse though. Previously, the end times would sometimes include some silence (be too late), but they never seemed to include any of the last word, so were good for splitting after a word/sentence. Now it seems like it accounts for silence at the end better, but seems to go a bit too far and includes the end of the word. In the same example at ~4s the word "gas" used to end at 4.06, now ends at 3.62, but should end at ~3.8.
We'll continue to test it a bit more today.
I see, good to know, we might be able to improve this with some VAD (shift the end time to the last point that the sound level was past a threshold), but this is also the same endpoint that openai/whisper gives for their word timestamps, so it might be a model issue, or need a bit more massaging to get perfect. There are many such so called "hacks" in the main repo that could be improved.
For detail: the reason it ended that far past the audio previously is because we were including the punctuation token ".", which has non-zero length, as part of the word's end time, the fix removed that time entirely, so it is ending exactly where it things the word "gas" ends, before the punctuation. Next step may be to consider some middle ground where the punctuation counts for some time but not the full token because it's not a spoken word. Open to ideas here too!
Got it, figured eventually we'd run into model limits. I think in our case I'll try just adding a small offset to the end since it seems pretty consistent, and in general adding silence is better than cutting words. VAD would be really nice but sounds a bit tricky to implement.
The timings of segments/words are sometimes inaccurate. When the attached audio is transcribed (we’re using base.en, but it seems to happen with larger models too), a lot of the segments have start times ~0.5sec after their actual start times. In the example, the word “Like” in “Like before…” should begin at 13.6s but WhisperKit is giving us 14.12s. This is happening for 5 out of the 10 segments in this audio.
I noticed that the segment contains a timing token with an accurate time of 13.6, but it uses 14.12 instead.
When word timestamps are disabled, the segment gets a start time of 13.04, which doesn't account for all the silence.
out.m4a.zip