argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
http://argmaxinc.com/blog/whisperkit
MIT License
3.98k stars 338 forks source link

Decreasing Speed and Delayed Confirmation in Stream Transcription Over Time #198

Open gavin1818 opened 3 months ago

gavin1818 commented 3 months ago

I’ve been using WhisperKit for real-time stream transcription in a project, and I’ve noticed that as time progresses, particularly after 20-30 minutes of continuous use, the transcription speed begins to decrease noticeably. Additionally, the transcript seems to remain unconfirmed for an extended period. During this time, the same text is repeated for a long duration within the unconfirmed segment, which results in the latest transcript not being transferred in a timely manner. This causes a significant gap between the audio and the corresponding transcription.

I’m aware that this issue might be challenging to resolve quickly, but I’m curious about the potential causes. Could this be related to Model-Level Issues, Decoder-Level Issues or others?

I would appreciate any insights into which areas might be the most likely cause of the issue. If there are specific parts of the code or certain tools I should use to investigate these potential causes further, I’d be grateful for the guidance.

Thanks

atiorh commented 3 months ago

@gavin1818 Thanks for the report! Are you able to share the input file that reproduces this?

ZachNagengast commented 3 months ago

@gavin1818 Could you also provide a little info on the model, device, an os you're running on? There could be a number of different potential issues depending on that.

vojto commented 3 months ago

The realtime demo seems to be running Whisper over the full recorded file, and using seeking to transcribe the most recent bit.

Would that be slowing things down? Would it make sense to cut it off - accepting that we would lose some of the context?

ZachNagengast commented 2 months ago

For realtime it is looking for a specific amount of confirmations to say that it effectively transcribed some set of audio without any cutoff in the middle of words. That is set to 2 segments currently which does end up being slower because we need to fix the timestamp filter to get smaller segments. You can configure it to only use 1 segment confirmation which might help. We also have a task to drop audio after transcribe is completed, but that wouldn't impact performance.

ppcfan commented 1 month ago

For realtime it is looking for a specific amount of confirmations to say that it effectively transcribed some set of audio without any cutoff in the middle of words. That is set to 2 segments currently which does end up being slower because we need to fix the timestamp filter to get smaller segments. You can configure it to only use 1 segment confirmation which might help. We also have a task to drop audio after transcribe is completed, but that wouldn't impact performance.

@ZachNagengast Do you mean that the current word-level timestamp might be inaccurate? Because if it were accurate, then cutting the buffer based on the end time of a word shouldn’t result in splitting a word in the middle, right?

ZachNagengast commented 1 month ago

Word-level timestamps have to be accurate for realtime (eager) mode to work, but cutting the buffer changes the start time of the audio, and since we are using seeking (aka clipTimestamps) to decode the next segment of real time audio as it comes in from the microphone, the seek value would also have to be reset if we're dropping audio from the buffer, in order for the clipTimestamps to correctly seek to the point in the buffer that has yet to be transcribed. None of that is super complicated to build into the system, so we're keeping all these kinds of things in mind for when we take eager mode out of the experimental stage.

ppcfan commented 1 month ago

Word-level timestamps have to be accurate for realtime (eager) mode to work, but cutting the buffer changes the start time of the audio, and since we are using seeking (aka clipTimestamps) to decode the next segment of real time audio as it comes in from the microphone, the seek value would also have to be reset if we're dropping audio from the buffer, in order for the clipTimestamps to correctly seek to the point in the buffer that has yet to be transcribed. None of that is super complicated to build into the system, so we're keeping all these kinds of things in mind for when we take eager mode out of the experimental stage.

Thank you for your reply