argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 268 forks source link

Some timing tokens are included in word timestamps #82

Closed finnvoor closed 6 months ago

finnvoor commented 6 months ago

When filtering out special tokens in addWordTimestamps, word timings that contain a timing token followed by a hyphen aren't filtered out correctly. WordTiming.tokens correctly contains just [532], but WordTiming.word is "<|0.00|> -". This seems to occur most when multiple people are talking over each other in a recording, I guess it's Whisper's way of trying to label speakers.

ZachNagengast commented 6 months ago

This seems like a bug in the token filtering, I will look into this, and if you have any audio files that replicate it that will be helpful as well!

finnvoor commented 6 months ago

Here's an example that has some "<|0.00|> [" words at 0.0, 29.6, 43.02, 45.6, 367.6, and 397.6. Transcribed using base.en, skipSpecialTokens: true, wordTimestamps: true.

Detail_20240327143246.m4a.zip

ZachNagengast commented 6 months ago

Interesting, I'm able to replicate this, thanks! Also a great example for testing in general 👍