Open kalradivyanshu opened 4 months ago
Yes, I agree that the ability to send multiple files at once will be awesome and it's in the TODO list. Basically, we need some bookkeeping in addition.
In your use case if the audios are always < 30sec, then you can zero pad them to 30 sec and stitch them together. The VAD should be able to cut them at voiced positions. However, if the sum length of two consecutive audios, audio1+audio2 < 30sec, they will be merged together and can result in a single transcription for 2 different audios. You can avoid this by providing the vad_segments
parameter manually.
For example:
vad_segments = [{'start': 0.0, end: 27.5, 'segments': [(0.0, 13.5),(13.7, 27.5)]},{'start': 31.5, end: 44.5, 'segments': [(31.5, 44.5)]}, {'start': 44.5, end: 60.5, 'segments': [(44.5, 49.5),(49.7, 60.5)]},...]
In the above example, the second and thrid entries together is less than 30 sec, but are split across two dictionaries, making sure each are processed separately in parallel.
Oh great! Thankyou for your reply!
@Jiltseb
However, if the sum length of two consecutive audios, audio1+audio2 < 30sec, they will be merged together and can result in a single transcription for 2 different audios. You can avoid this by providing the vad_segments parameter manually.
For example: vad_segments = [{'start': 0.0, end: 27.5, 'segments': [(0.0, 13.5),(13.7, 27.5)]},{'start': 31.5, end: 44.5, 'segments': [(31.5, 44.5)]}, {'start': 44.5, end: 60.5, 'segments': [(44.5, 49.5),(49.7, 60.5)]},...] In the above example, the second and third entries together is less than 30 sec, but are split across two dictionaries, making sure each are processed separately in parallel.
In this example the second segment is13s, and third is 16s. So if I provide the vad segments, I am guessing VAD will not run? So I can't just combine the audio chunks, I have to send them through VAD then get the audio segments and then send them in, right?
My point is that if the second segment is 13s but contains 10s of silence in the end, it can cause Whisper to hallucinate, and since I am manually sending VAD segments, VAD will be skipped in faster-whisper?
So my flow should be:
vad_segments
dict to pass to faster_whisper
Right? Thank you for all your help!
If you already provide vad_segments
, VAD will not run internally in addition (no need to run anyway), have a look at the code.
If you have big silences inside the individual audio files that can potentially cause hallucinations, that's another thing to look for.
I was simply mentioning just to pad them to 30sec each.
let's say audio lengths are:
audio 1: 27.5 sec length: zero pad to 30 sec
audio 2: 13 sec length: zero pad to 30 sec
audio 3: 16 sec length: zero pad to 30 sec
In this case the vad segments will be:
vad_segments = [{'start': 0.0, end: 27.5, 'segments': [(0.0, 27.5)]},{'start': 30.0, end: 43, 'segments': [(30.0, 43)]}, {'start': 60, end: 76, 'segments': [(60, 76)]},...]
A bit hacky implementation without utilizing GPUs fully. But once we have multiple files as input, should be easier for you.
Hey @Jiltseb, Thank you for the detailed reply! While playing around with batched model, I saw that VAD segments detection seems to be buggy in the new batched pipeline, I opened a new issue: #919.
Also batched_model with batch_size = 1 seems to be a lot more consistent performance than model.transcribe? Why is that? Model.transcribe sometimes spikes to 1s to process 30s on my L40s, while batched_model with batch_size = 1 always takes around 270ms. I am curious, are there other performance improvements in batched_model?
It looks like #919 is related to word_timestamps
.
There are several reasons for it. batched_model
does not have all the settings as in the original one (for example, temperature fallback) skips some checks, and makes each segment independent of the next one in the batch or pipeline. Have a look at the original PR to see additional details on improvement. Batching removes the dependency on a bigger context, sometimes leading to better results if the context is ambiguous output of the previous segment.
Hey @Jiltseb if I were to try and open a PR to add ability to send multiple files, how would I go about it? Can you give me a rough guide?
Have a look at whisper S2T: https://github.com/shashikg/WhisperS2T They provide support for multiple files.
Since #856 got merged, I was wondering if we can have sending multiple files in one go into faster-whisper, something like:
This would help usecases where you have a lot of small files. I have a use case where I want to transcript multiple files of upto 30s audio (they will never be more than 30s), so I was wondering if I could stitch them together and pass them in as 1 file into
BatchedInferencePipeline
? In my limited tests this seems to work, will the segment always be of exact 30s, basically can I be guaranteed if I pad my audio to be exact 30s, that each segment will be for each audio and no segment will contain any transcription from 2 different audios?Thank you for all your work!
@Jiltseb