Need ability to send multiple files in one go

kalradivyanshu commented 4 months ago

Since #856 got merged, I was wondering if we can have sending multiple files in one go into faster-whisper, something like:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
results = batched_model.transcribe(["audio0.mp3", "audio1.mp3", "audio2.mp3", "audio3.mp3"])

for result in results:
      for segment, info in result:
        print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

This would help usecases where you have a lot of small files. I have a use case where I want to transcript multiple files of upto 30s audio (they will never be more than 30s), so I was wondering if I could stitch them together and pass them in as 1 file into BatchedInferencePipeline? In my limited tests this seems to work, will the segment always be of exact 30s, basically can I be guaranteed if I pad my audio to be exact 30s, that each segment will be for each audio and no segment will contain any transcription from 2 different audios?

Thank you for all your work!

@Jiltseb

Jiltseb commented 4 months ago

Yes, I agree that the ability to send multiple files at once will be awesome and it's in the TODO list. Basically, we need some bookkeeping in addition. In your use case if the audios are always < 30sec, then you can zero pad them to 30 sec and stitch them together. The VAD should be able to cut them at voiced positions. However, if the sum length of two consecutive audios, audio1+audio2 < 30sec, they will be merged together and can result in a single transcription for 2 different audios. You can avoid this by providing the vad_segments parameter manually.

For example: vad_segments = [{'start': 0.0, end: 27.5, 'segments': [(0.0, 13.5),(13.7, 27.5)]},{'start': 31.5, end: 44.5, 'segments': [(31.5, 44.5)]}, {'start': 44.5, end: 60.5, 'segments': [(44.5, 49.5),(49.7, 60.5)]},...]

In the above example, the second and thrid entries together is less than 30 sec, but are split across two dictionaries, making sure each are processed separately in parallel.

kalradivyanshu commented 4 months ago

Oh great! Thankyou for your reply!

kalradivyanshu commented 4 months ago

@Jiltseb

However, if the sum length of two consecutive audios, audio1+audio2 < 30sec, they will be merged together and can result in a single transcription for 2 different audios. You can avoid this by providing the vad_segments parameter manually.

For example: vad_segments = [{'start': 0.0, end: 27.5, 'segments': [(0.0, 13.5),(13.7, 27.5)]},{'start': 31.5, end: 44.5, 'segments': [(31.5, 44.5)]}, {'start': 44.5, end: 60.5, 'segments': [(44.5, 49.5),(49.7, 60.5)]},...] In the above example, the second and third entries together is less than 30 sec, but are split across two dictionaries, making sure each are processed separately in parallel.

In this example the second segment is13s, and third is 16s. So if I provide the vad segments, I am guessing VAD will not run? So I can't just combine the audio chunks, I have to send them through VAD then get the audio segments and then send them in, right?

My point is that if the second segment is 13s but contains 10s of silence in the end, it can cause Whisper to hallucinate, and since I am manually sending VAD segments, VAD will be skipped in faster-whisper?

So my flow should be:

Preprocess all audio chunks by passing them through VAD and removing all silence > 1s
Concatenate all audio chunks into one, and create vad_segments dict to pass to faster_whisper
Fix the segment start and endtimes

Right? Thank you for all your help!

Jiltseb commented 4 months ago

If you already provide vad_segments, VAD will not run internally in addition (no need to run anyway), have a look at the code. If you have big silences inside the individual audio files that can potentially cause hallucinations, that's another thing to look for. I was simply mentioning just to pad them to 30sec each. let's say audio lengths are: audio 1: 27.5 sec length: zero pad to 30 sec audio 2: 13 sec length: zero pad to 30 sec audio 3: 16 sec length: zero pad to 30 sec

In this case the vad segments will be: vad_segments = [{'start': 0.0, end: 27.5, 'segments': [(0.0, 27.5)]},{'start': 30.0, end: 43, 'segments': [(30.0, 43)]}, {'start': 60, end: 76, 'segments': [(60, 76)]},...]

A bit hacky implementation without utilizing GPUs fully. But once we have multiple files as input, should be easier for you.

kalradivyanshu commented 4 months ago

Hey @Jiltseb, Thank you for the detailed reply! While playing around with batched model, I saw that VAD segments detection seems to be buggy in the new batched pipeline, I opened a new issue: #919.

Also batched_model with batch_size = 1 seems to be a lot more consistent performance than model.transcribe? Why is that? Model.transcribe sometimes spikes to 1s to process 30s on my L40s, while batched_model with batch_size = 1 always takes around 270ms. I am curious, are there other performance improvements in batched_model?

Jiltseb commented 4 months ago

It looks like #919 is related to word_timestamps.

There are several reasons for it. batched_model does not have all the settings as in the original one (for example, temperature fallback) skips some checks, and makes each segment independent of the next one in the batch or pipeline. Have a look at the original PR to see additional details on improvement. Batching removes the dependency on a bigger context, sometimes leading to better results if the context is ambiguous output of the previous segment.

kalradivyanshu commented 3 months ago

Hey @Jiltseb if I were to try and open a PR to add ability to send multiple files, how would I go about it? Can you give me a rough guide?

Jiltseb commented 3 months ago

Have a look at whisper S2T: https://github.com/shashikg/WhisperS2T They provide support for multiple files.

SYSTRAN / faster-whisper

Need ability to send multiple files in one go #915