argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 267 forks source link

Improve token timestamps and language detection #114

Closed ZachNagengast closed 5 months ago

ZachNagengast commented 5 months ago

This addresses a couple of issues

  1. Word level timestamps slightly off, noticed in #105
  2. Detect language was not usable easily in conjunction with prefill or prompt tokens noticed by Diirge in discord.

The word timestamps are still not using a median filter but they line up quite well without it. With these changes, the main differences are when words start, most of the endings are perfectly in line.

Here are some comparisons using the audio provided in #105 (Top is ours, bottom is from HEAD openai/whisper python repo)

WhisperKit better starting point:

image

OpenAI better starting point:

image

Will continue to refine these over time, thanks @finnvoor for finding this and providing a great example to replicate.