argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
http://argmaxinc.com/blog/whisperkit
MIT License
3.92k stars 330 forks source link

Resample audio files in 10mb chunks #158

Closed finnvoor closed 5 months ago

finnvoor commented 5 months ago

closes #16

Resampling audio files in 10mb chunks reduces the peak memory usage and fixes some niche issues with transcribing very long or very high sample rate / channel count audio files.

Before After
before after

10mb is a bit arbitrary, but I chose it to roughly match the peak memory usage of the rest of the pipeline.

I expect this will have a very minor negative impact on speed of resampling, but given this is a small fraction of the time compared to the rest of the pipeline + the memory savings, it seems like a reasonable tradeoff.

ZachNagengast commented 5 months ago

Amazing! Was just about to look at this, do you think there's any impact on audio quality a the breakpoints? I wouldn't expect much just curious

finnvoor commented 5 months ago

Amazing! Was just about to look at this, do you think there's any impact on audio quality a the breakpoints? I wouldn't expect much just curious

hmm, not really sure but I doubt it would be enough to notice. I didn't test much but I got the same transcript when a file was split into ~16 chunks.

atiorh commented 5 months ago

Thanks for the contrib @finnvoor! We will run full evals for 1.0.0 on all this behavior and address regressions (if any). This looks to be low risk but we might need to couple this with VAD to be double sure.

ZachNagengast commented 4 months ago

FYI there appears to be an issue with this code that is placing audio in the wrong position in the outputBuffer. I am working on an approach that appends to the buffer every 10MB instead of writes directly to it.