Improve token timestamps and language detection

This addresses a couple of issues

Word level timestamps slightly off, noticed in #105
Detect language was not usable easily in conjunction with prefill or prompt tokens noticed by Diirge in discord.

The word timestamps are still not using a median filter but they line up quite well without it. With these changes, the main differences are when words start, most of the endings are perfectly in line.

Here are some comparisons using the audio provided in #105 (Top is ours, bottom is from HEAD openai/whisper python repo)

WhisperKit better starting point:

OpenAI better starting point:

Will continue to refine these over time, thanks @finnvoor for finding this and providing a great example to replicate.

argmaxinc / WhisperKit

Improve token timestamps and language detection #114