batch execution transcribe in faster-whisper

eschmidbauer commented 1 year ago

Hello, Is batch execution of faster-whisper's transcribe possible? We've seen in this thread that batch execution should increase the throughput. But it's not clear how to perform batch using faster-whisper if at all. Thanks!

guillaumekln commented 1 year ago

Hi,

The model implemented in CTranslate2 supports batch execution (with some caveats), but faster-whisper currently implements the same transcription logic as openai/whisper which only processes a single audio file.

We could add a batch mode in the future.

Note that there is already a way to increase throughput for CPU execution: increase num_workers and call transcribe from multiple Python threads.

eschmidbauer commented 1 year ago

thanks, i was able to figure it out with threading

Joepetey commented 1 year ago

Hey @eschmidbauer, could you elaborate on how you accomplished the threading?

eschmidbauer commented 1 year ago

here is an example, pass the faster whisper model to your Process thread

        for file in file_list:
            thread: process.Process = process.Process(
                model=model, file_path=file)
            thread.start()
            threads.append(thread)

        for thread in threads:
            thread.join()

guillaumekln commented 1 year ago

Let's keep this issue open. It could be interesting to have an actual batch execution, especially on GPU.

timnlupo commented 1 year ago

+1 here

junshipeng commented 1 year ago

mark

ezerhouni commented 1 year ago

FYI: There is a fork with batch inference for the open ai implementation: https://github.com/openai/whisper/discussions/662

guillaumekln commented 1 year ago

WhisperX pushed an experimental branch implementing batch execution with faster-whisper:

https://github.com/m-bain/whisperX/issues/159#issuecomment-1521619648

natecarlson commented 1 year ago

An implementation note - it would be great to be able to both segment large audio files (as WhisperX does), and have the option to pass in a bunch of independent audio files and run those as a batch.

chainyo commented 1 year ago

WhisperX pushed an experimental branch implementing batch execution with faster-whisper:

m-bain/whisperX#159 (comment)

@guillaumekln , The faster-whisper transcribe implementation is still faster than the batch request option proposed by whisperX. I re-created, with some simplification (I don't use the Binarizer), the entire batching pipeline, and it's like 2x-3x slower than using faster-whisper with num_workers=1, which is sad 🤗 The data exchange between CPU and GPU takes most of the inference time. It could come from my implementation, so I'm still investigating.

ref: https://github.com/Wordcab/wordcab-transcribe/blob/6-investigate-on-a-better-gpu-usage/wordcab_transcribe/services/transcribe_service.py

aleksandr-smechov commented 1 year ago

@guillaumekln Any thoughts on the above?

guillaumekln commented 1 year ago

The data exchange between CPU and GPU takes most of the inference time

How did you find that?

The code looks correct to me but you should try simplifying the usage:

Make sure to use the latest version of ctranslate2
Start with num_workers=1
You could remove asynchronous=True since you are waiting for the results just after.
If you are not reusing the encoder output, it's better to not run the encoder separately. In that case you can directly pass the input features to the method generate.

See also my comment in the whisperX issue about the current limitations of batch execution in CTranslate2: https://github.com/m-bain/whisperX/issues/159#issuecomment-1527789800

chainyo commented 1 year ago

The data exchange between CPU and GPU takes most of the inference time

How did you find that?

I timed each function execution with the time.time() utility in Python (not ideal for ML stuff), and there is a gap of 8 seconds between the inference time (2-3 sec) and getting back the transcription results in the other process of the API (11 sec). So I'm not sure it comes from the data moving from GPU to CPU, but I don't see any other part that could make the full process takes 8 seconds more than the inference time.

I'm playing with py-spy atm, I will try to profile the process and see exactly what's involved in this time gap.

Otherwise, thanks for the nits/modifications suggested.

chainyo commented 1 year ago

@guillaumekln Great news, I was wrong! After profiling the batching process, it appears that the problem doesn't come from the batch process implementation, but from the SileroVad model, which is not on the GPU and takes A LOT of extra time.

CleanShot 2023-05-31 at 15 07 24

Here is a profile of the transcription process using faster-whisper and the whisperX batch request-like implementation.

1 (red) = transcription with faster-whisper.transcribe() 2 (green) = original PyTorch SileroVad model (on CPU) 3 (orange) = the batch request like implementation

It looks like it's blazing fast for the CPU part (the GPU part is not in the profiler). Now, I need to deal with word_timestamps and the other parameters set aside for getting a fast implementation to benchmark.

Honestly, this could lead to an implementation directly in faster-whisper if it works. I will keep investigating the process.

sahamrit commented 1 year ago

@guillaumekln so i was trying to enable batching support. Here are some points and blockers I feel. Would love to hear your thoughts.

The transcribe.py needs to handle multiple audios as you mentioned. This would require some bookkeeping, taking care of different seeks etc, which seems doable. I had done a similar exercise with OAI whisper codebase so familiar with it.
The blocker seems to be with supporting variable length prompt token. Doing it in OAI codebase was simple enough by padding like - sot_prev, pad, prompt, sot, ... , and then modifying the mask to support batching + this mask, however changing mask in Ctranslate2 codebase seems to be a challenge as explained below.

OAI mask

0 -inf -inf ... 0 0 -inf ... 0 0 0 ...

Length mask in ctranslate2

1 2 3

So modifying OAI mask is simple enough for padded tokens by changing values of mask, however for Ctranslate2 masks gets created on fly inside softmax kernel by zeroing out out of range positions.

Hence modifying it for padding seems a breaking enough change.

Below is the softmax kernel implementation - https://github.com/OpenNMT/CTranslate2/blob/master/src/cpu/kernels.cc

    void softmax<TARGET_ISA>(const float* input,
                             const int32_t* lengths,
                             float* output,
                             dim_t batch_size,
                             dim_t depth,
                             bool log,
                             float epsilon) {
      using VecType = Vec<float, TARGET_ISA>;

      parallel_for(0, batch_size, 1, [&](dim_t begin, dim_t end) {
        for (dim_t i = begin; i < end; ++i) {
          const dim_t offset = i * depth;
          const float* x = input + offset;
          float* y = output + offset;

          dim_t size = depth;
          if (lengths) {
            size = lengths[i];

            // Directly set 0 in output for out of range positions.
            for (dim_t j = size; j < depth; ++j) {
              y[j] = 0;
            }

guillaumekln commented 1 year ago

Thank you for looking into that.

The mask is not the only change to make in the model. When inputs are padded in the left, each example has a different offset when applying the positional embeddings which can no longer be applied with a simple addition:

https://github.com/openai/whisper/blob/f572f2161ba831bae131364c3bffdead7af6d210/whisper/model.py#L204-L207

Instead there should be something that tracks the offset of each example and gather different positions of the positional embeddings. Something like position_ids in Transformers, see for example https://github.com/huggingface/transformers/pull/22382.

This change is a bit more complex, especially if we want to make it compatible with models using different position encoding techniques (rotary embeddings, relative positions, etc.).

sahamrit commented 1 year ago

@guillaumekln thanks for responding. Am aware of the position_ids. This can be created on fly via prefix sums on mask.

        mask_binary = torch.exp(mask)
        position_ids = mask_binary.long().cumsum(-1) - 1
        position_ids.masked_fill_(mask_binary == 0, 1)
        position_ids = position_ids[:, -1]

        x = self.token_embedding(x) + self.positional_embedding[position_ids[:,offset : offset + x.shape[-1]]]

However I completely agree that introducing these simple changes in CTranslate2 is a bit of effort because things are just so tangled, that the changes need to be at much lower level - ops/ kernel.cc etc. For instance the softmax implementation will surely need to be changed to support batching as I mentioned above. Let me know if there exist a easier way or am interpreting correctly

hobodrifterdavid commented 1 year ago

Could someone give a little summary of the batching feature? It's useable, but, initial prompts for the submitted batch must be the same? And, word_timestamps cannot be used? Is the timing of the subs otherwise good?

I'd like to process long audio files (tv programs, audiobooks, podcasts), currently breaking up to 6 min chunks, staggered with a 1 min overlap, running transcription for the chunks in parallel on faster-whisper instances (seperate python processes with faster-whisper wrapped with FastAPI, regular non-batched 'transcribe') on several gpus, then merging the transcriptions by finding the least offensive 'switch' point in the overlaping sections.. seems to work well. I'd like to try batch processing (to get more throughput by sending multiple chunks to each faster-whisper instance), but don't want to sacrifice the quality of the timings.

I don't have a need for word-level timings, this suggests it would be better to leave it off?: https://github.com/guillaumekln/faster-whisper/issues/337

EDIT:

Guillaume suggests a way in issue #100 "Multiple transcriptions can run in parallel when the model is using multiple workers or running on multiple GPUs" - ==> This sounds like it's running parallel independent transcriptions on one or more gpus rather than a true batch that increases throughput.

Found this in issue #133 https://github.com/RomanKlimov/faster-whisper-acceleration "This program dramatically accelerates the transcribing of single audio files using Faster-Whisper by splitting the file into smaller chunks at moments of silence, ensuring no loss in transcribing quality. By consuming and processing each audio chunk in parallel, this project achieves significant acceleration using only CPUs." ==> Interesting, but no mention of gpus.

Also the issue: "There is the small draw-back, that whisper feeds the transcription of the previous 30s window as prompt to the next window, to get a continuous transcription." ==> Good to know, can include an extra 30s overlap of the chunks.

https://github.com/m-bain/whisperX/blob/main/whisperx/asr.py "FasterWhisperModel provides batched inference for faster-whisper. Currently only works in non-timestamp mode and fixed prompt for all samples in batch." ==> Here it looks like WhisperX wraps faster-whisper to enable true batching but maybe using some lower-level operations, and no timestamps without the wav2vec model, need to take a closer look. I'm concerned wav2vec models might not give great timing with background music, gunfights etc.

If someone has any pointers for what I'm trying to do, I would appreciate it.

PS Something I figured out a few days ago: If you chopping up an audiofile to chunks with ffmpeg, put the -ss (start time) argument before the -i (file path), it's much faster, otherwise ffmpeg parses the whole file or something, and gets slower the further in your clip is in the file. You can feed in/out audiodata with pipes rather than files. let cmd = ffmpeg -y -ss ${seek_seconds.toString()} -i ${audioFilePath} -ac 1 -vn -codec:a pcm_s16le -ar 16000 -ac 1 -t ${duration_seconds.toString()} -f wav pipe:1

guillaumekln commented 1 year ago

Could someone give a little summary of the batching feature?

Currently there is no batching mechanism in faster-whisper, just like there is no batching mechanism in openai-whisper. In this issue we discuss the possible ways to integrate batching in faster-whisper.

The underlying implementation in CTranslate2 does support batching, but with the limitations discussed above. The main limitation is that it does not support left padding in the input tokens which is mostly required to keep the same transcription logic. This limitation could be addressed at some point. WhisperX chose to not pass the previous tokens and so worked around this limitation.

However, the internal methods used to compute the word timestamps already support batch inputs.

hobodrifterdavid commented 1 year ago

Ignore this, mostly nonsense reasoning based on misunderstanding, see next comment. 🥇 💯

I see. Say I had a 100min audioclip, if I split into 5 min chucks (+30s initial +60s for 'merging window') like this:~

[-0:30 - 6:00] # Prepend 30s of silence [4:30 - 11:00] [9:30 - 16:00] [14:30 - 21:00] [19:30 - 26:00]

~I can then discard any transcriptions that start in the initial 30s window, and I still have 60s of overlap between the chunks to find the best merge point. The merging code is working well. Processing an extra 30s of audio is not perfectly optimal, but if it allows a ~3x increase in throughput?~

The extra earlier 30s I suppose is degraded by not having the tokens from the previous 30s, and this affects the main transcription, but, that's perhaps very minor? There's also the vad.. I suppose it makes sense to do chunking after the vad has run..

Sorry to be the guy that doesn't read the code, I'll spend an hour now and see what I can make sense of.

EDIT: rereading the posts, I think this is a relevant comment: https://github.com/m-bain/whisperX/issues/159#issuecomment-1528096377

..transcribe without_timestamps=True, this is necessary otherwise Whisper might do multiple forward passes with a 30s sample (delaying the whole batch) and can also lead to repetition etc. ...

..Of course (i) can be quite limiting due to the need for timestamped transcripts, but in WhisperX timestamps are sourced from VAD & wav2vec2 alignment -- from my research findings Whisper timestamps were just too unreliable...

I'm not sure what to make of the first point. For the second, Faster-whisper timings seem alright to me, perhaps some fixes have improved things since April. I find the simpler pipeline of faster-whisper appealing, also from the issues on WhisperX the segmentation may not be great, and not all langauges have models for doing time alignment.

EDIT2: Another thought. If I understand correctly, the issue with the padding of the input tokens is relevant when you want to increase throughput when transcribing multiple audio files concurrently. However, when you break a single file into 20 chunks, and transcribe five chunks per gpu on, say, 4 gpus simultaneously, you won't have the input tokens from a preceding chunk anyway. But you can get the transcription to the user in ~30s instead of 10 mins.

hobodrifterdavid commented 1 year ago

Sorry, I realised I misunderstood. I was reading the whisperX code this evening, it's not just that the initial prompt that must be fixed, but the tokens/text of one 30s segment is not used when processing the following segment.

manbaaaa commented 1 year ago

Hi,

The model implemented in CTranslate2 supports batch execution (with some caveats), but faster-whisper currently implements the same transcription logic as openai/whisper which only processes a single audio file.

We could add a batch mode in the future.

Note that there is already a way to increase throughput for CPU execution: increase num_workers and call transcribe from multiple Python threads.

@guillaumekln CTranslate2 supports batch execution (with some caveats), but I haven't found relevant usage instructions. Could you provide a more specific tutorial on how to utilize batch execution with CTranslate2? Thank you.

guillaumekln commented 1 year ago

See the methods documentation in CTranslate2: https://opennmt.net/CTranslate2/python/ctranslate2.models.Whisper.html. Note that all methods take batch inputs.

This test case is a possible example on how to build batch inputs for CTranslate2:

https://github.com/OpenNMT/CTranslate2/blob/61d34502325bfa3c5ef8a11cd2e391d0efed1bf9/python/tests/test_transformers.py#L789

guillaumekln commented 1 year ago

I pushed an experimental branch in CTranslate2 to support variable-length text inputs for the Whisper model:

https://github.com/OpenNMT/CTranslate2/pull/1457

This could allow running the Whisper transcription in batch mode even with condition_on_previous_text=True.

It seems that multiple people in this thread tried to implement some form of batching in faster-whisper. It would be great if you can use the experimental branch and see how far you can go with your batch implementation (and share performance numbers!).

To install this CTranslate2 development build:

Go to this build page
Download the artifact "python-wheels"
Extract the archive
Install the wheel matching your system and Python version, for example:

pip install --force-reinstall ctranslate2-3.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

chainyo commented 1 year ago

Hi @guillaumekln, I'm checking this for batching support.

Suppose I understand well the ctranslate2 experimental branch allows batching over multiple audio files and then enables transcribing numerous audio files in parallel by extending the VRAM usage. Is it only that? What about batching over one file? (I think it's still an architecture problem?)

guillaumekln commented 1 year ago

Batching over one file is typically not possible with condition_on_previous_text=True (which is default value) because you need the previous transcription to process a 30-second window. In this case the audio needs to be processed sequentially.

Batching with condition_on_previous_text=False is already possible and does not require an experimental CTranslate2 branch. See WhisperX for example.

chainyo commented 1 year ago

Thanks for the WhisperX link. I already checked this implementation and reproduced it (by simplifying the part with HF transformers pipeline to a PyTorch pipeline). Still, this implementation is prone to problems (hallucinations, some words disappearing, wrong transcription...).

I will experiment with audio file batching closer to how the transcription pipeline works. Thanks a lot.

hobodrifterdavid commented 1 year ago

@guillaumekln Oh, I have been watching the ctranslate2 and faster-whisper commits for this update, but missed these messages. Great. I'll give it a try in the next few days.

hobodrifterdavid commented 1 year ago

Okay, I got a bit of time free, and a machine to play on.

@guillaumekln I followed the special build instructions, looks okay.

So, what is needed is something like WhisperModel.transcribe, but rather WhisperModel.transcribe_batch

I'll take a shot, are there any points/things to look out for?

hobodrifterdavid commented 1 year ago

humhum, it mostly looks pretty straightforward to add _batch versions of existing functions, putting List[] around the function argument types etc... but, there's some logic around temperatures, generate is rerun under certain criteria (if needs_fallback is True).

Nevermind, managed to code around it and keep the logic the same.. about 75% done..

Arche151 commented 1 year ago

~humhum, it mostly looks pretty straightforward to add _batch versions of existing functions, putting List[] around the function argument types etc... but, there's some logic around temperatures, generate is rerun under certain criteria (if needs_fallback is True).~

Nevermind, managed to code around it and keep the logic the same.. about 75% done..

Hey, sorry I am kind of a noob, but if I understand correctly you are now working on a pull request to implement batching into faster-whisper and the batching will work by utilizing more VRAM? :)

salahzoubi commented 1 year ago

@hobodrifterdavid any updates on this? Really looking forward to batch transcriptions...

hobodrifterdavid commented 12 months ago

@salahzoubi

I didn't get to finish it yet, but here is a modified transcribe.py with added _batch functions: https://gist.github.com/hobodrifterdavid/c437ead4d167b52ca6c0373b0a12529d

You can diff it against the one in the current repo to see the changes (https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py). There's not a lot of work to finish it, but I was hoping @guillaumekln would be able to comment on the approach.

Looks like there were a couple of minor changes to transcribe.py from the version I was editing.

I saw in the comments that Guillaume is moving on, but hopefully faster-whisper will get it's batch mode. :)

hobodrifterdavid commented 11 months ago

I took another look over the weekend. I wasn't satisfied with the approach of making duplicate batch functions.. these functions are less readable that the original.

Control passes:

transcribe() => generate_segments() => generate_with_fallback() => model.generate() transcribe() <= generate_segments() <= generate_with_fallback() <=

Instead of making batch versions of all these functions, it could work better to have a wrapper around transcribe, that passes in a special implemenation of generate_with_fallback (generate_with_fallback_cukoo), that uses async/await to cede control back to the wrapper mid-way through execution, so that a batch function can be called.

Something like this (it's a sketch, not working code):

    def transcribe_batch(
        self,
        audio_list_passed: List[Union[str, BinaryIO, np.ndarray]], # list
        language_list_passed: List[Union[str, None]], # list
        task_list_passed: List[Union[str, None]], # list
        initial_prompt_list_passed: List[Union[str, Iterable[int], None]], # list
        beam_size: int = 5,
        best_of: int = 5,
        patience: float = 1,
        length_penalty: float = 1,
        repetition_penalty: float = 1,
        temperature: Union[float, List[float], Tuple[float, ...]] = [
            0.0,
            0.2,
            0.4,
            0.6,
            0.8,
            1.0,
        ],
        compression_ratio_threshold: Optional[float] = 2.4,
        log_prob_threshold: Optional[float] = -1.0,
        no_speech_threshold: Optional[float] = 0.6,
        condition_on_previous_text: bool = True,
        prompt_reset_on_temperature: float = 0.5,
        prefix: Optional[str] = None,
        suppress_blank: bool = True,
        suppress_tokens: Optional[List[int]] = [-1],
        without_timestamps: bool = False,
        max_initial_timestamp: float = 1.0,
        word_timestamps: bool = False,
        prepend_punctuations: str = "\"'“¿([{-",
        append_punctuations: str = "\"'.。,，!！?？:：”)]}、",
        vad_filter: bool = False,
        vad_parameters: Optional[Union[dict, VadOptions]] = None,
        batch_size = 10,

    ) -> List[Tuple[Iterable[Segment], TranscriptionInfo]]:

        encoder_output_list: List[ctranslate2.StorageView],
        prompt_list: List[List[int]],
        tokenizer_list: List[Tokenizer],
        options_list: List[TranscriptionOptions],

        futures: List[asyncio.future] = []

        async def generate_with_fallback_cukoo(encoder_output: ctranslate2.StorageView, prompt: List[int], tokenizer: Tokenizer, options: TranscriptionOptions):

            encoder_output_list.append(encoder_output)
            prompt_list.append(prompt)
            tokenizer_list.append(tokenizer)
            options_list.append(options)

            future = asyncio.get_running_loop().create_future()
            futures.append(future)
            result = await future
            return result

        rv = []

        for i in range(len(audio_list_passed)):

            ##### EDIT: actually probably don't want to await transcribe here yet, await it after the batch function has executed..
            rv.append(await self.transcribe(
                audio=audio_list_passed[i],
                language=language_list_passed[i],
                task=task_list_passed[i],
                beam_size=beam_size,
                best_of=best_of,
                patience=patience,
                length_penalty=length_penalty,
                repetition_penalty=repetition_penalty,
                temperature=temperature,
                compression_ratio_threshold=compression_ratio_threshold,
                log_prob_threshold=log_prob_threshold,
                no_speech_threshold=no_speech_threshold,
                condition_on_previous_text=condition_on_previous_text,
                prompt_reset_on_temperature=prompt_reset_on_temperature,
                initial_prompt=initial_prompt_list_passed[i], # nope
                prefix=prefix,
                suppress_blank=suppress_blank,
                suppress_tokens=suppress_tokens,
                without_timestamps=without_timestamps,
                max_initial_timestamp=max_initial_timestamp,
                word_timestamps=word_timestamps,
                prepend_punctuations=prepend_punctuations,
                append_punctuations=append_punctuations,
                vad_filter=vad_filter,
                vad_parameters=vad_parameters,
               ########## pass in the special cukoo function: ##########
                generate_with_fallback=generate_with_fallback 
            ))

            if len(encoder_output_list) == batch_size:
                # Okays time to call the batch function..
                results = self.generate_with_fallback_batch(encoder_output_list, prompt_list, tokenizer_list, options_list)

                for j in range(len(results)):
                    futures[j].set_result(results[j])

                encoder_output_list= []
                prompt_list= []
                tokenizer_list= []
                options_list= []
                futures = []

        results = self.generate_with_fallback_batch(encoder_output_list, prompt_list, tokenizer_list, options_list)
        for j in range(len(results)):
            futures[j].set_result(results[j])

        # actually need to transform this a bit still..
        return rv

One thing.. I don't think it's possible to do this without making transcribe and generate_segments async. I'm not familar enough with Python async/await to say 100%. But you could make transcribe_async() and generate_segments_async(), and wrap them with simple non-async wrappers transcribe() and generate_segments().

If there are transcribe_async and transcribe etc., they should have the same arguments, it's a bit clumsy to maintain two sets of the long argument lists, proabably a bit better to pass in a class with options as an argument, but then the api changes.

BBC-Esq commented 11 months ago

Hello, I'm wondering if there's any progress on this issue? Wish I could help, but unfortunately I'm not an expert. However, I have reviewed the new repos like the relatively new https://github.com/Vaibhavs10/insanely-fast-whisper and WhisperX's implementation, but was wondering if anything similar is going to be implemented directly in faster-whisper like batching for a single file? I know there's been a several issues created this over a couple repositories...

Tongjilibo commented 9 months ago

mark

Jiltseb commented 5 months ago

Created this PR last week that integrate batching and additional improvements to Faster Whisper

https://github.com/SYSTRAN/faster-whisper/pull/856

SYSTRAN / faster-whisper

batch execution transcribe in faster-whisper #59