Closed tien1504 closed 7 months ago
@tien1504 Good question! The SegmentSeeking protocol and its extensions define the logic for shifting the audio segment from 0-30 seconds to T-T+30, allowing >30s inputs. Chunking is a possible way to improve throughput when processing a single long file but WhisperKit does not have an implementation of chunking. One consideration is that chunking does not necessarily lead to aligned segments (where there is speech in the beginning) and we have observed that Whisper's performance on unaligned segments is degraded.
We will talk about some other throughput-optimized configurations for multiple audio file processing soon. We also have earnings22 evaluations (test dataset comprising 1 hour long audio files) for several Whisper implementations to demonstrate the accuracy of various multi-segment audio processing strategies and will be publishing them here.
Hope that answers your question @tien1504! If not you can still respond in this thread with any follow ups
Does it have a duration limit? I remember that Whisper limits the input file to 30 seconds, but when I tested it on macOS, the app could handle much longer duration audio files. Do you have to chunk the audio files before transcription?