Closed ZachNagengast closed 7 months ago
It seems like skipSpecialTokens
isn't respected when using word level timestamps, I still get a bunch of <|22.42|> timing tokens. Not sure if this is intentional or not.
The word timings seem to be one word behind where they should be, I could be doing something wrong but just matching the WordTiming start and end to video + audio, the timing seems to be super precise but exactly 1 word behind. You can see in this video if you unmute and watch the highlighted words.
https://github.com/argmaxinc/WhisperKit/assets/8284016/e73a2ce6-2760-499f-8a37-9bc3443ddcc9
@finnvoor Thanks for giving this an early look, these comments are super helpful. I'm really impressed you were able to get that video output working, thanks for sharing - fascinating to see it with your overlay.
It seems like
skipSpecialTokens
isn't respected when using word level timestamps, I still get a bunch of <|22.42|> timing tokens. Not sure if this is intentional or not.
Open question - I agree that skipSpecialTokens should remove them, but also wondering if they should just be removed by default? I.e. perhaps someone wants special tokens in the text responses, but not in the word timings.
The word timings seem to be one word behind where they should be, I could be doing something wrong but just matching the WordTiming start and end to video + audio, the timing seems to be super precise but exactly 1 word behind. You can see in this video if you unmute and watch the highlighted words.
This appears to be correct, will investigate. Also mentioned in another comment the punctuations for contractions are a little off too because it's being combined with the next word instead of the current word. Will continue to refine this but I suspect these are related. Will report back soon.
This is amazing @finnvoor, thanks for the review!
@finnvoor Just pushed a fix for some of the issues you reported.
Here's a short clip of the properly aligned word subtitles, as suspected they were 1 off previously.
This latest commit should also handle contractions much better.
https://github.com/argmaxinc/WhisperKit/assets/1981179/0cf435a8-3843-4146-9284-21503979aa58
Based on your feedback (and anyone else's) I will also adjust how the special tokens are handled in the word timestamps too.
looks much better now!
https://github.com/argmaxinc/WhisperKit/assets/8284016/3bfc1b79-8e01-4e2b-bd14-ecd86ca49d57
I can't imagine there's much use for having word level timings for special tokens, since they aren't really associated with time in audio. I think every use of Whisper I've seen has filtered out special tokens anyway
@finnvoor Thanks for the feedback, these will no longer include special tokens. Also added some of the heuristics from the openai reference repo. I did notice that large-v3 was giving some pretty off results (lots of 0s length words) but v2 was fine, so something to keep in mind.
This is intended as an initial "functional" PR. Example code and usage guidelines will be coming as a fast-follow. Will also update the default models on huggingface to have the appropriate outputs. In the meantime, you can use this CLI script to test out the flow:
Download tiny.en (only one with alignment weights currently)
Transcribe with
--word-timestamps
flagOutputs the following json:
https://gist.github.com/ZachNagengast/f36a751bc68a3b5f2c41ada8bcc33746
Resolves #2