VAD audio chunking - Githubissues

argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon

https://takeargmax.com/blog/whisperkit

MIT License

3.17k stars 267 forks source link

VAD audio chunking #135

Closed jkrukowski closed 4 months ago

jkrukowski commented 4 months ago

This PR introduces audio chunking with VAD. The VAD is used to detect speech segments in the audio file and then the audio is split into chunks based on the detected speech segments (and padded with zeros to match the 30sec length). Chunks are then processed in a batch resulting in a significant speedup.

Some benchmarks (on my mac book air m1):

Audio file 12:16 length

with VAD:

38.16s user 5.86s system 470% cpu 9.349 total

without VAD:

33.25s user 3.55s system 132% cpu 27.678 total

Audio file 40:26 length

with VAD:

126.54s user 18.41s system 500% cpu 28.952 total

without VAD:

96.55s user 10.47s system 133% cpu 1:20.08 total

To use it in WhisperKitCLI the user has to pass the chunking-strategy flag:

swift run -c release whisperkit-cli transcribe --audio-path /path/to/audio.wav --chunking-strategy vad

atiorh commented 4 months ago

Great work @jkrukowski! Did you see the Cut and Merge strategy in https://github.com/m-bain/whisperX?

If we don't attempt to pack short segments into 30 seconds like above before padding, the worst case performance might regress below baseline (e.g. padding ~1-3s chunks to 30s a lot). Let me know if you think Cut and Merge is an extension we should leave as future work or bundle here :)

Edit: Cut and Merge will also mean some additional bookkeeping to adjust word-level timetamps post-inference.

atiorh commented 4 months ago

@Abhinay1997 Do you mind rebasing on top of this PR so we can add a WER check (w/ and w/o VAD-based chunking) on your long audio test sample? 🙏

Abhinay1997 commented 4 months ago

Hey @atiorh ! No worries, I'll do that by tomorrow. Want to make sure there are no bugs/crash prone code in my PR.

jkrukowski commented 4 months ago

Great work @jkrukowski! Did you see the Cut and Merge strategy in https://github.com/m-bain/whisperX?

If we don't attempt to pack short segments into 30 seconds like above before padding, the worst case performance might regress below baseline (e.g. padding ~1-3s chunks to 30s a lot). Let me know if you think Cut and Merge is an extension we should leave as future work or bundle here :)

Edit: Cut and Merge will also mean some additional bookkeeping to adjust word-level timetamps post-inference.

I'd leave it as a future work if possible. After talking to @ZachNagengast the other day I took a bit different approach here -- using VAD I'm trying to find the best cut off point in the 2nd half of 30sec audio chunk. So there is no risk of having a bunch of small segments padded with zeros (because the segment will contain at least 15 sec of the original audio). Having said that I think that cut and merge is a better (but more complicated) approach

atiorh commented 4 months ago

Great work @jkrukowski! Did you see the Cut and Merge strategy in https://github.com/m-bain/whisperX? If we don't attempt to pack short segments into 30 seconds like above before padding, the worst case performance might regress below baseline (e.g. padding ~1-3s chunks to 30s a lot). Let me know if you think Cut and Merge is an extension we should leave as future work or bundle here :) Edit: Cut and Merge will also mean some additional bookkeeping to adjust word-level timetamps post-inference.

I'd leave it as a future work if possible. After talking to @ZachNagengast the other day I took a bit different approach here -- using VAD I'm trying to find the best cut off point in the 2nd half of 30sec audio chunk. So there is no risk of having a bunch of small segments padded with zeros (because the segment will contain at least 15 sec of the original audio). Having said that I think that cut and merge is a better (but more complicated) approach

Makes sense, this is great.

ZachNagengast commented 4 months ago

Here's a recording of the example app running chunking about 4x faster with minimal WER loss 🚀

https://github.com/argmaxinc/WhisperKit/assets/1981179/fe403ceb-b752-4396-ab7a-905eb3351c40