argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
2.98k stars 248 forks source link

Segment order regression since 10mb chunking #163

Closed iandundas closed 1 week ago

iandundas commented 1 month ago

Since https://github.com/argmaxinc/WhisperKit/pull/158 was merged, we're seeing segments being delivered in the wrong order, including in the example app.

8fcfadbe37 (correct before) 25a07498 (incorrect after)
image image

Settings:

image

Sample file: http://172.104.253.215/atp-7-min-clip.m4a

Full transcripts:

Full correct transcript Full incorrect transcript

ZachNagengast commented 1 month ago

Looking at this shortly, do you have any sense of what parts specifically changed between then? Might give a clue

iandundas commented 1 month ago

I don't have a great handle on it, it seems completely reordered and some segments are missing

For example, in the correct transcription the word "easter" occurs once:

[WhisperKit] [Segment 115] [474.04 --> 476.70] So, you know Easter just happened in.

Whilst in the bad transcription it appears four times:

CleanShot 2024-06-13 at 12 49 36@2x

meanwhile, the first line of the good transcription contains

[WhisperKit] [Segment 0] [0.00 --> 30.00] Do you have also just finishing listening to the hot pockets episode?

whilst this doesn't appear in the bad transcription at all

iandundas commented 1 month ago

good.txt bad.txt