argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 268 forks source link

Support Word Timestamps #38

Closed ZachNagengast closed 7 months ago

ZachNagengast commented 7 months ago

This is intended as an initial "functional" PR. Example code and usage guidelines will be coming as a fast-follow. Will also update the default models on huggingface to have the appropriate outputs. In the meantime, you can use this CLI script to test out the flow:

  1. Download tiny.en (only one with alignment weights currently)

    make download-model MODEL=tiny.en
  2. Transcribe with --word-timestamps flag

    swift run transcribe --word-timestamps \
    --model-path "Models/whisperkit-coreml/openai_whisper-tiny.en" \
    --audio-path ~/Downloads/ted_60.wav \
    --report \
    --report-path ~/Downloads \
    --verbose

    Outputs the following json:

https://gist.github.com/ZachNagengast/f36a751bc68a3b5f2c41ada8bcc33746

Resolves #2

finnvoor commented 7 months ago

It seems like skipSpecialTokens isn't respected when using word level timestamps, I still get a bunch of <|22.42|> timing tokens. Not sure if this is intentional or not.

finnvoor commented 7 months ago

The word timings seem to be one word behind where they should be, I could be doing something wrong but just matching the WordTiming start and end to video + audio, the timing seems to be super precise but exactly 1 word behind. You can see in this video if you unmute and watch the highlighted words.

https://github.com/argmaxinc/WhisperKit/assets/8284016/e73a2ce6-2760-499f-8a37-9bc3443ddcc9

ZachNagengast commented 7 months ago

@finnvoor Thanks for giving this an early look, these comments are super helpful. I'm really impressed you were able to get that video output working, thanks for sharing - fascinating to see it with your overlay.

It seems like skipSpecialTokens isn't respected when using word level timestamps, I still get a bunch of <|22.42|> timing tokens. Not sure if this is intentional or not.

Open question - I agree that skipSpecialTokens should remove them, but also wondering if they should just be removed by default? I.e. perhaps someone wants special tokens in the text responses, but not in the word timings.

The word timings seem to be one word behind where they should be, I could be doing something wrong but just matching the WordTiming start and end to video + audio, the timing seems to be super precise but exactly 1 word behind. You can see in this video if you unmute and watch the highlighted words.

This appears to be correct, will investigate. Also mentioned in another comment the punctuations for contractions are a little off too because it's being combined with the next word instead of the current word. Will continue to refine this but I suspect these are related. Will report back soon.

atiorh commented 7 months ago

This is amazing @finnvoor, thanks for the review!

ZachNagengast commented 7 months ago

@finnvoor Just pushed a fix for some of the issues you reported.

Here's a short clip of the properly aligned word subtitles, as suspected they were 1 off previously.

This latest commit should also handle contractions much better.

https://github.com/argmaxinc/WhisperKit/assets/1981179/0cf435a8-3843-4146-9284-21503979aa58

Based on your feedback (and anyone else's) I will also adjust how the special tokens are handled in the word timestamps too.

finnvoor commented 7 months ago

looks much better now!

https://github.com/argmaxinc/WhisperKit/assets/8284016/3bfc1b79-8e01-4e2b-bd14-ecd86ca49d57

I can't imagine there's much use for having word level timings for special tokens, since they aren't really associated with time in audio. I think every use of Whisper I've seen has filtered out special tokens anyway

ZachNagengast commented 7 months ago

@finnvoor Thanks for the feedback, these will no longer include special tokens. Also added some of the heuristics from the openai reference repo. I did notice that large-v3 was giving some pretty off results (lots of 0s length words) but v2 was fine, so something to keep in mind.