argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 268 forks source link

Duration limit? #12

Closed tien1504 closed 7 months ago

tien1504 commented 7 months ago

Does it have a duration limit? I remember that Whisper limits the input file to 30 seconds, but when I tested it on macOS, the app could handle much longer duration audio files. Do you have to chunk the audio files before transcription?

atiorh commented 7 months ago

@tien1504 Good question! The SegmentSeeking protocol and its extensions define the logic for shifting the audio segment from 0-30 seconds to T-T+30, allowing >30s inputs. Chunking is a possible way to improve throughput when processing a single long file but WhisperKit does not have an implementation of chunking. One consideration is that chunking does not necessarily lead to aligned segments (where there is speech in the beginning) and we have observed that Whisper's performance on unaligned segments is degraded.

atiorh commented 7 months ago

We will talk about some other throughput-optimized configurations for multiple audio file processing soon. We also have earnings22 evaluations (test dataset comprising 1 hour long audio files) for several Whisper implementations to demonstrate the accuracy of various multi-segment audio processing strategies and will be publishing them here.

ZachNagengast commented 7 months ago

Hope that answers your question @tien1504! If not you can still respond in this thread with any follow ups