SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2

MIT License

11.6k stars 962 forks source link

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Closed Jiltseb closed 2 months ago

Jiltseb commented 4 months ago

Hello everyone,

This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!

Speed improvements:

Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.
Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the enable_ta_fe flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!

Using the batched version is straightforward:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Quality Improvements

Consistency across runs: By setting the model seed, consistency across runs is improved.
Reducing hallucinations: Stricter checks in the inference pipeline reduce unstructured or repeated phrases.
Reliable language detection: A new function detects language more reliably by considering highly confident and random segments, breaking ties to determine the major language.
Code-switching support: Handles audio with multiple languages by detecting language every 30 seconds and dynamically directing data flow. Since the exact language switching position is unknown, this can have an error within a 30 sec segment range.

Language detection Usage:

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")

Benchmarking:

A. Open source benchmarking:

Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.

Speed (x real-time):

System	Speed GPU	Speed CPU
OpenAI Whisper	8.2x	4.5x
faster-whisper	20.1x	5.6x
HF Whisper (batched)	59.3x	8.4x
Batched Faster-Whisper	104x	14.6x

WER:

System	WER
OpenAI Whisper	15.1
faster-whisper	14.6
HF Whisper (batched)	16.8
Batched Faster-Whisper	13.1

B. Internal dataset:

Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.

System	WER	Speed
OpenAI Whisper	6.8	9.1x
faster-whisper	6.1	17.4x
HF Whisper (batched)	8.2	42.8x
Batched Faster-Whisper	6.5	86.6x

Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.

Thank you in advance!

Acknowledgements

This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.

BBC-Esq commented 4 months ago

I'm curious what the beam size settings were for the above tests/comparisons?

Jiltseb commented 4 months ago

I'm curious what the beam size settings were for the above tests/comparisons?

beam_size is kept at default (5 beams) for all the experiments.

BBC-Esq commented 4 months ago

Excellent. I'd love to switch back to faster-whisper. I departed when a faster option came out named WhisperS2T. Have you had a chance to bench/compare with that one located here? I spent a few hours reviewing yours and its code, fairly similar. However, the other parts of faster-whisper add additional features.

https://github.com/shashikg/WhisperS2T

Also, do you have a tokens/second metric or total processing time perhaps? I'm not familiar with the x"realtime" metric. I'll most likely try testing the code myself today but am not an expert.

Jiltseb commented 4 months ago

Excellent. I'd love to switch back to faster-whisper. I departed when a faster option came out named WhisperS2T. Have you had a chance to bench/compare with that one located here? I spent a few hours reviewing yours and its code, fairly similar. However, the other parts of faster-whisper add additional features.

https://github.com/shashikg/WhisperS2T

Also, do you have a tokens/second metric or total processing time perhaps? I'm not familiar with the x"realtime" metric. I'll most likely try testing the code myself today but am not an expert.

Looks similar as it also uses batching via VAD. The arguments such as beam_size, best_of, temperature and several other parameters are predefined to make it fast in whisperS2T, though its effect on WER varies with audio type.

I do not have a benchmark against it, as the primary motive of this PR is to provide batching support for faster_whisper in itself, along with additional quality improvements. I just ran it for the internal test set, keeping beam_size, best_of, temperature similar to whisper_S2T, though there are more such default values in whisper_S2T. You are welcome to try out both.

System	Speed GPU	WER
Whisper_S2T	110x	7.7
Batched Faster-Whisper	107x	6.4.

For your question on speed measurement, ASR systems generally used RTF (Real-time factor) in the past, mostly for short-form transcription. The current measure (x real-time) is 1/RTF as it is easier to understand. It is the ratio of the total duration of the audio file to the processing time, so you can get the processing time from it if needed. It shows how fast the processing time relative to the total duration. tokens/sec is typically used for decoder-only models (such as LLMs) as it measures the decoding time.

Jiltseb commented 4 months ago

@Purfview @trungkienbkhn

BBC-Esq commented 4 months ago

Thank you, audio duration related to processing time makes sense as "real time" test, just hadn't heard of it. Do you have the python script I could try? I love benchmarking.

At any rate, it's my understanding that WhisperS2T segments an audio file with the VAD but that those "chunks"/"segments" are not necessarily what is sent to the Whisper model in a batches. It includes another step where it splits the segments if they're longer than the 30-window that Whisper is capable of processing (whispers2t actually specifies a "max_len" of 29 seconds). Additionally, it will "stitch" consecutive segments together as long as, collectively, they don't exceed the 29 seconds window. I'm guessing it does this to try and maximize the size of each chunk within each batch. Might check this out if it improves your approach as well? I believe this is in the audio.py script.

Also, what's your opinion on the enable_ta_fe and the use of "Kaldi" compared to WhisperS2t's approach? It's my understanding that both the original Whisper as well as WhisperS2T rely on TorchSTFT, but that WhisperS2T uses a custom approach while the original whisper uses TorchSTFT directly? Also, WhisperS2T uses a pre-computed filterbank stored in an asset file while Faster-Whisper-Batch computes is on-the-fly (like the original whisper approach).

Just curious how these specifically compare. Perhaps using a pre-computed mel filterbank with your approach might improvement things.

Overall, congratulations on the WER. That's arguably more important than the minimal difference in speed between the two IMHO. I look forward to possibly switching back to faster-whisper if/when it's merged! If you have the benchmarking script I'd like to test just for fun. Thanks!

trungkienbkhn commented 4 months ago

@Jiltseb , thank you for your interesting PR. I ran some benchmarks from here with GPU NVIDIA Tesla V100S and large-v3 model (compute_type='float16', batch_size=16) Below are the results:

1. Speed benchmark: Processing audio with duration 13:19.231s Detected language 'fr' with probability 1.00

System	Min execution time
Faster-Whisper	48.622s
Batched Faster-Whisper	14.776s

2. WER benchmark: Dataset: librispeech_asr Number of audio used for evaluation: 500

System	WER
Faster-Whisper	3.097
Batched Faster-Whisper	1.773

3. Memory benchmark: GPU name: Tesla V100S-PCIE-32GB GPU device index: 0

System	Maximum increase of RAM	Maximum GPU memory usage	Maximum GPU power usage
Faster-Whisper	1099 MiB	4644MiB / 32768MiB	245W / 250W
Batched Faster-Whisper	1768 MiB	9024MiB / 32768MiB	273W / 250W

And here is the comparison log between FW and batched FW:

Faster-whisper:

Processing audio with duration 13:19.231
Detected language 'fr' with probability 1.00
Processing segment at 00:00.000
[0.00s -> 3.00s]  eric vautier Relecteur 1er essai
[17.00s -> 18.00s]  Bonsoir.
[21.00s -> 24.00s]  Notre planète est recouverte à 70 % d'océans,
[25.00s -> 28.00s]  et pourtant, étrangement, on a choisi de l'appeler la Terre.
Processing segment at 00:28.000
[28.00s -> 32.00s]  Le poète Edward Williams a une vision bien plus objective
[32.00s -> 35.00s]  et moins anthropocentrique quand il dit que,
[35.00s -> 37.00s]  vu de l'espace, la planète est bleue.
[37.00s -> 40.00s]  Vu de l'espace, elle est le territoire non pas des hommes,
[40.00s -> 42.00s]  mais des baleines.
[42.00s -> 45.00s]  Et pourtant, on vient tous de l'océan.
[45.00s -> 48.00s]  C'est le berceau de la vie, même si on l'a oublié.
[48.00s -> 51.00s]  L'océan est partout, dans les glaciers,
[51.00s -> 53.00s]  dans les rivières, dans les nappes phréatiques,
[53.00s -> 55.00s]  dans les cellules des êtres vivants,
[55.00s -> 57.00s]  et dans nos veines.
Processing segment at 00:58.000
[58.00s -> 61.00s]  Étrangement, c'est John Fitzgerald Kennedy
[61.00s -> 64.00s]  qui l'a assez bien illustré dans cette citation.
[64.00s -> 66.00s]  « Il est un fait biologique intéressant
[66.00s -> 68.00s]  que chacun d'entre nous ait dans les veines
[68.00s -> 70.00s]  un pourcentage identique de sel dans le sang
[70.00s -> 72.00s]  à celui qui existe dans les océans.
[72.00s -> 74.00s]  Nous avons donc tous du sel dans notre sang,
[74.00s -> 76.00s]  dans notre sueur, dans nos larmes.
[76.00s -> 78.00s]  Nous sommes liés à l'océan.
[78.00s -> 80.00s]  Et quand nous retournons à la mer,
[80.00s -> 82.00s]  que ce soit pour naviguer ou pour la regarder,
[82.00s -> 84.00s]  nous retournons d'où nous venons. »
[86.00s -> 88.00s]  Et pourtant, cet océan, on le connaît
Processing segment at 01:28.000
[88.00s -> 90.00s]  très très mal.
[90.00s -> 92.00s]  Ça reste un monde assez étrange
[92.00s -> 94.00s]  et étranger
[94.00s -> 96.00s]  et qui fait peur parfois
...

Batched faster-whiper

Detected language: fr (1.00) in first 30s of audio...
[17.19s -> 41.46s]  Bonsoir. Notre planète est recouverte à 70% d'océans, et pourtant, étrangement, on a choisi de l'appeler la Terre. Le poète Edward Williams a une vision bien plus objective et moins anthropocentrique quand il dit que vu de l'espace, la planète est bleue. Vu de l'espace, elle est le territoire non pas des hommes, mais des baleines.
[42.86s -> 56.68s]  Et pourtant, on vient tous de l'océan, c'est le berceau de la vie, même si on l'a oublié. L'océan est partout, dans les glaciers, dans les rivières, dans les nappes phréatiques, dans les cellules des êtres vivants, et dans nos veines.
[58.72s -> 84.32s]  Étrangement, c'est John Fitzgerald Kennedy qui l'a assez bien illustré dans cette citation. Il est un fait biologique intéressant que chacun d'entre nous ait dans les veines un pourcentage identique de sel dans le sang à celui qui existe dans les océans. Nous avons donc tous du sel dans notre sang, dans notre sueur, dans nos larmes. Nous sommes liés à l'océan. Et quand nous retournons à la mer, que ce soit pour naviguer ou pour la regarder, nous retournons d'où nous venons.
[85.96s -> 96.62s]  Et pourtant, cet océan, on le connaît très très mal. Ça reste un monde assez étrange et étranger, et qui fait peur parfois.
....

=> It can be seen that the hallucinations of batched FW have decreased (hallucination from 0-17s has disappeared), but FW's timestamp log will be denser.

In conclusion, I found that the speed has improved a lot (while wer benchmark is still improving). I think it will be very feasible to apply batched FW to realtime problems for ASR.

Jiltseb commented 4 months ago

@Jiltseb , thank you for your interesting PR. I ran some benchmarks from here with GPU NVIDIA Tesla V100S and large-v3 model (compute_type='float16', batch_size=16) Below are the results:

1. Speed benchmark: Processing audio with duration 13:19.231s Detected language 'fr' with probability 1.00

System Min execution time Faster-Whisper 48.622s Batched Faster-Whisper 14.776s 2. WER benchmark: Dataset: librispeech_asr Number of audio used for evaluation: 500

System WER Faster-Whisper 3.097 Batched Faster-Whisper 1.773 3. Memory benchmark: GPU name: Tesla V100S-PCIE-32GB GPU device index: 0

System Maximum increase of RAM Maximum GPU memory usage Maximum GPU power usage Faster-Whisper 1099 MiB 4644MiB / 32768MiB 245W / 250W Batched Faster-Whisper 1768 MiB 9024MiB / 32768MiB 273W / 250W And here is the comparison log between FW and batched FW:

Faster-whisper:

Processing audio with duration 13:19.231
Detected language 'fr' with probability 1.00
Processing segment at 00:00.000
[0.00s -> 3.00s]  eric vautier Relecteur 1er essai
[17.00s -> 18.00s]  Bonsoir.
[21.00s -> 24.00s]  Notre planète est recouverte à 70 % d'océans,
[25.00s -> 28.00s]  et pourtant, étrangement, on a choisi de l'appeler la Terre.
Processing segment at 00:28.000
[28.00s -> 32.00s]  Le poète Edward Williams a une vision bien plus objective
[32.00s -> 35.00s]  et moins anthropocentrique quand il dit que,
[35.00s -> 37.00s]  vu de l'espace, la planète est bleue.
[37.00s -> 40.00s]  Vu de l'espace, elle est le territoire non pas des hommes,
[40.00s -> 42.00s]  mais des baleines.
[42.00s -> 45.00s]  Et pourtant, on vient tous de l'océan.
[45.00s -> 48.00s]  C'est le berceau de la vie, même si on l'a oublié.
[48.00s -> 51.00s]  L'océan est partout, dans les glaciers,
[51.00s -> 53.00s]  dans les rivières, dans les nappes phréatiques,
[53.00s -> 55.00s]  dans les cellules des êtres vivants,
[55.00s -> 57.00s]  et dans nos veines.
Processing segment at 00:58.000
[58.00s -> 61.00s]  Étrangement, c'est John Fitzgerald Kennedy
[61.00s -> 64.00s]  qui l'a assez bien illustré dans cette citation.
[64.00s -> 66.00s]  « Il est un fait biologique intéressant
[66.00s -> 68.00s]  que chacun d'entre nous ait dans les veines
[68.00s -> 70.00s]  un pourcentage identique de sel dans le sang
[70.00s -> 72.00s]  à celui qui existe dans les océans.
[72.00s -> 74.00s]  Nous avons donc tous du sel dans notre sang,
[74.00s -> 76.00s]  dans notre sueur, dans nos larmes.
[76.00s -> 78.00s]  Nous sommes liés à l'océan.
[78.00s -> 80.00s]  Et quand nous retournons à la mer,
[80.00s -> 82.00s]  que ce soit pour naviguer ou pour la regarder,
[82.00s -> 84.00s]  nous retournons d'où nous venons. »
[86.00s -> 88.00s]  Et pourtant, cet océan, on le connaît
Processing segment at 01:28.000
[88.00s -> 90.00s]  très très mal.
[90.00s -> 92.00s]  Ça reste un monde assez étrange
[92.00s -> 94.00s]  et étranger
[94.00s -> 96.00s]  et qui fait peur parfois
...

Batched faster-whiper

Detected language: fr (1.00) in first 30s of audio...
[17.19s -> 41.46s]  Bonsoir. Notre planète est recouverte à 70% d'océans, et pourtant, étrangement, on a choisi de l'appeler la Terre. Le poète Edward Williams a une vision bien plus objective et moins anthropocentrique quand il dit que vu de l'espace, la planète est bleue. Vu de l'espace, elle est le territoire non pas des hommes, mais des baleines.
[42.86s -> 56.68s]  Et pourtant, on vient tous de l'océan, c'est le berceau de la vie, même si on l'a oublié. L'océan est partout, dans les glaciers, dans les rivières, dans les nappes phréatiques, dans les cellules des êtres vivants, et dans nos veines.
[58.72s -> 84.32s]  Étrangement, c'est John Fitzgerald Kennedy qui l'a assez bien illustré dans cette citation. Il est un fait biologique intéressant que chacun d'entre nous ait dans les veines un pourcentage identique de sel dans le sang à celui qui existe dans les océans. Nous avons donc tous du sel dans notre sang, dans notre sueur, dans nos larmes. Nous sommes liés à l'océan. Et quand nous retournons à la mer, que ce soit pour naviguer ou pour la regarder, nous retournons d'où nous venons.
[85.96s -> 96.62s]  Et pourtant, cet océan, on le connaît très très mal. Ça reste un monde assez étrange et étranger, et qui fait peur parfois.
....

=> It can be seen that the hallucinations of batched FW have decreased (hallucination from 0-17s has disappeared), but FW's timestamp log will be denser.

In conclusion, I found that the speed has improved a lot (while wer benchmark is still improving). I think it will be very feasible to apply batched FW to realtime problems for ASR.

Yes, unlike faster-whisper, there is no additional step that divides 30-second predictions into segments. This makes the timestamp logs less dense.

Yes, speed is improved owing to batching and faster feature extraction. We have a sweet spot with improved WER and speed in general.

BBC-Esq commented 4 months ago

@Jiltseb and @trungkienbkhn

One more improvement, and it might be prudent to implement this after this PR once the groundwork of this PR is incorporated, and that is as follows;

WhisperS2T accepts a list of audio files, which increases the speed when processing multiple files a decent amount. For example, it'll take 5 audio files, use VAD to get the audio segments to transcribe (i.e. removing silent portions with VAD), stitch some back together (as I outlined above), and then send the chunks as part of a batch. And if it's processing multiple files, it'll include the segments/chunks from different audio files in the same batch. For example, let's say that Audio 1 is a "relatively" short audio is broken up into 20 segments after VAD and "stitching" and is sent to the all-powerful Ctranslate2 generator...But let's assume the batch size is 40...The batch size of 40 will not be saturated. Let's further assume that Audio 2 would also be processed as 20 segments/chunks...WhisperS2T handles this "list" of audio files by combining the chunks from Audio 1 and Audio 2 into a single batch despite the chunks originating from different audio files. It then keeps track via metadata the file from which each chunk/segments comes from as well as its timestamp information.

If I understand your PR correctly, it's geared towards processing one file at a time. You might look at accepting a "list" of audio files and handling it this way. I've tested it and it's faster than processing multiple files individually. Again, might be something to incorporate after this PR's groundwork is incorporated but wanted to put it on your radar.

HOWEVER, there is a BIG PROBLEM with how WhisperS2T handles this. If a single audio file fails, it causes the whole list of audio files to fail...I outlined this here:

https://github.com/shashikg/WhisperS2T/issues/50

The WhisperS2T owner said it would be an easy fix until he got caught up with work-related stuff and hasn't updated the repo in months so...just FYI.

Anyways, here are some benchmarks of WhisperS2T. You'll notice that as the batch sizes increases VRAM increases linearly and tokens/s flatlines at a certain point. This is the point where even though batches are being put in VRAM, the CUDA cores are saturated. I didn't include the tiny model because I would never use it for reliable transcriptions:

[EDIT] Just FYI, the benchmark I did was in increments of 2, 4, 6, 8, etc.

Jiltseb commented 4 months ago

@trungkienbkhn Can you review the PR?

aleksandr-smechov commented 3 months ago

@Jiltseb Awesome work, will be incorporating this into wordcab-transcribe. Curious about the codename choice, Mobius :)

BBC-Esq commented 3 months ago

Hey @MahmoudAshraf97 , what's a good link in your repo code where you implement whisperx? I've been meaning to benchmark it alongside my other benchmarks of other backends?

MahmoudAshraf97 commented 3 months ago

Hi @BBC-Esq , first of all I enjoy reading your thorough comments and analysis and the useful insights they contain. In my repo I use whisperx transcription directly without any modification Here, or perhaps you mean the forced alignment part?

trungkienbkhn commented 3 months ago

@trungkienbkhn Can you review the PR?

@Jiltseb , sorry for the late reply. I'm a bit busy at the office at the moment. But I’ll review your PR and get back to you as soon as possible.

hobodrifterdavid commented 3 months ago

@Jiltseb

"Yes, unlike faster-whisper, there is no additional step that divides 30-second predictions into segments. This makes the timestamp logs less dense."

Hello. Is it possible to also split up transcripts into smaller pieces, to make nice subtitles? Do you pass the recognised text from the previous 30s chunk when recognising the following chunk?

BBC-Esq commented 3 months ago

Here are my most recent benchmarks in case you're interested. Since faster-whisper only has a batch size of 1 currently it's at the far left, but hopefully soon this pull request will be incorporated and it can look forward to the speedups illustrated by the whisperX and whisperS2T repositories!

MahmoudAshraf97 commented 3 months ago

Here are my most recent benchmarks in case you're interested. Since faster-whisper only has a batch size of 1 currently it's at the far left, but hopefully soon this pull request will be incorporated and it can look forward to the speedups illustrated by the whisperX and whisperS2T repositories!

I see that WhisperS2T starts getting faster for batch sizes > 4, I honestly prefer that implementation over whisperx because it's cleaner IMO, i.e. it doesn't use the HF pipeline iterator which I find hard to understand. I guess it's also important to note that these benchmarks are testing the whole transcription pipeline, which might not be indicative of which batching implementation is faster because it also measures data preparation and other pre/post processing steps

BBC-Esq commented 3 months ago

Here are my most recent benchmarks in case you're interested. Since faster-whisper only has a batch size of 1 currently it's at the far left, but hopefully soon this pull request will be incorporated and it can look forward to the speedups illustrated by the whisperX and whisperS2T repositories!

I see that WhisperS2T starts getting faster for batch sizes > 4, I honestly prefer that implementation over whisperx because it's cleaner IMO, i.e. it doesn't use the HF pipeline iterator which I find hard to understand. I guess it's also important to note that these benchmarks are testing the whole transcription pipeline, which might not be indicative of which batching implementation is faster because it also measures data preparation and other pre/post processing steps

Good point. I do try to control for those things as much as possible; for example, specifying no timestamps or no diarization for all backends being tested but there can be slight differences...WhisperX has a lot of options compared to WhisperS2T that might add compute time but I'll explain the benefits. I might say..."WhisperS2T is faster by 10%, but faster-whisper has 5 features that it doesn't...you decide if the 10% is worth not being able to use the 5 features.

Other things I try to control for are:

Re VRAM - taking a baseline reading before loading the model, polling max usage while processing, and then subtracting the baseline from the highest polled reading.
Making sure my GPU is on the same "power state" for all tests.
Processing the same audio file.
Using the same version of dependencies - i.e. there might have been even a small efficiency improvement in a new release version.

Some backends might not expose certain parameters and I'll be upfront about that. HF's implementation is horrible in terms of VRAM anything above a beam size of 1...That's a huge benefit of the ctranslate2 backend because, as you know, it affects the word error rate. On the flip side, Transformers is so widely used that the 1% improved WER using ctranslate2 might not be worth adding another dependency to a program that relies on Transformers...learning a new API, etc...To give another example, Ctranslate2 is completely dominated in terms of the sheer number of chat model architectures that Transformers supports so...again, let the user decide. I try to be up front that there's very rarely a winner.

I feel like I've worn out my welcome discussing benchmarks on this pull request thread, but if you'd like to continue discussing periodically and possibly offer some advice I'd be interested in discussing more on Discord or what not. It's my fault though because I posted my fancy benchmark graphs... ;-)

P.S., When I say "horrible" for HF's implementation of beam size...I mean, using a beam size of 5 for anything above the small.en Whisper model results in an OOM on my RTX 4090. Somehow, they've implemented that parameter completely different than ctranslate2. I can't even include HF in my graphs unless I re-test all backends with a beam size of 1 because of this.

Jiltseb commented 3 months ago

Here are my most recent benchmarks in case you're interested. Since faster-whisper only has a batch size of 1 currently it's at the far left, but hopefully soon this pull request will be incorporated and it can look forward to the speedups illustrated by the whisperX and whisperS2T repositories!

I see that WhisperS2T starts getting faster for batch sizes > 4, I honestly prefer that implementation over whisperx because it's cleaner IMO, i.e. it doesn't use the HF pipeline iterator which I find hard to understand. I guess it's also important to note that these benchmarks are testing the whole transcription pipeline, which might not be indicative of which batching implementation is faster because it also measures data preparation and other pre/post processing steps

Good point. I do try to control for those things as much as possible; for example, specifying no timestamps or no diarization for all backends being tested but there can be slight differences...WhisperX has a lot of options compared to WhisperS2T that might add compute time but I'll explain the benefits. I might say..."WhisperS2T is faster by 10%, but faster-whisper has 5 features that it doesn't...you decide if the 10% is worth not being able to use the 5 features.

Other things I try to control for are:

Re VRAM - taking a baseline reading before loading the model, polling max usage while processing, and then subtracting the baseline from the highest polled reading.

Making sure my GPU is on the same "power state" for all tests.

Processing the same audio file.

Using the same version of dependencies - i.e. there might have been even a small efficiency improvement in a new release version.

Some backends might not expose certain parameters and I'll be upfront about that. HF's implementation is horrible in terms of VRAM anything above a beam size of 1...That's a huge benefit of the ctranslate2 backend because, as you know, it affects the word error rate. On the flip side, Transformers is so widely used that the 1% improved WER using ctranslate2 might not be worth adding another dependency to a program that relies on Transformers...learning a new API, etc...To give another example, Ctranslate2 is completely dominated in terms of the sheer number of chat model architectures that Transformers supports so...again, let the user decide. I try to be up front that there's very rarely a winner.

I feel like I've worn out my welcome discussing benchmarks on this pull request thread, but if you'd like to continue discussing periodically and possibly offer some advice I'd be interested in discussing more on Discord or what not. It's my fault though because I posted my fancy benchmark graphs... ;-)

P.S., When I say "horrible" for HF's implementation of beam size...I mean, using a beam size of 5 for anything above the small.en Whisper model results in an OOM on my RTX 4090. Somehow, they've implemented that parameter completely different than ctranslate2. I can't even include HF in my graphs unless I re-test all backends with a beam size of 1 because of this.

Thanks for the benchmark between different ctranslate2 whisper implementations. Of course, it does not make sense to compare Faster Whisper with them without batching, you could have used the development version for testing that. Also, I assume you use the same parameter set (temperature, beam_size) across different methods. Benchmarking is incomplete without measuring the quality of transcription, so I would suggest you use WER metric as there are some differences between these implementations.

Since you are interested in VRAM usage: We recently released a blog post explaining how to improve tokens/sec decoding speed (up to 6x) for torch-based Whisper models via torch compilation and HQQ quantization. https://mobiusml.github.io/whisper-static-cache-blog/

Requested feature to implement HQQ quantization in ctranslate2 format: https://github.com/OpenNMT/CTranslate2/issues/1717

trungkienbkhn commented 3 months ago

@Jiltseb , I tested new option enable_ta_fe=True. But it seems to be slower than normal FW (I don't use multi batch). Total transcription time for comparison as follows:

with enable_ta_fe=True: 157.502s
without: 141.217s

I used this audio for testing.

Below is my logic:

model = WhisperModel('large_v3', device="cuda")
segments, info = model.transcribe(audio_path, enable_ta_fe=True)
for segment in segments:
    print("[%.5fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Could you take a look ?

Jiltseb commented 3 months ago

@Jiltseb , I tested new option enable_ta_fe=True. But it seems to be slower than normal FW (I don't use multi batch). Total transcription time for comparison as follows:

with enable_ta_fe=True: 157.502s

without: 141.217s

I used this audio for testing.

Below is my logic:
model = WhisperModel('large_v3', device="cuda")
segments, info = model.transcribe(audio_path, enable_ta_fe=True)
for segment in segments:
    print("[%.5fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Could you take a look ?

I used the same inference (typo for large-v3). Here is the result:

NVIDIA TITAN RTX GPU

enable_ta_fe=True total_time : 51.40676712989807 sec
enable_ta_fe=False total_time: 54.78384590148926 sec

CPU: float16

enable_ta_fe=True total_time: 553.8739361763 sec
enable_ta_fe=False total_time: 590.2705080509186 sec

CPU: int8

enable_ta_fe=True total_time: 428.8190245628357 sec
enable_ta_fe=False total_time: 435.5438516139984 sec

The speed of faster whisper with normal feature extraction on GPU is more than 5.6x (if the total time is 141.217 sec). With my tests, it comes to 14.58x, which is more realistic.

Could you please verify your implementation as even without enable_ta_fe=True, it needs to be faster?

BBC-Esq commented 3 months ago

@Jiltseb I'm editing this message to correct myself because it's early and I haven't gotten my coffee to be honest. Therefore, to clarify, for chat models I turn sampling completely off but for my Whisper benchmarks I try to control and mimic for these exact settings below, which is a "default" model.py script within the ```whisperS2T library:

FAST_ASR_OPTIONS = {
    "beam_size": 1,
    "best_of": 1, # Placeholder
    "patience": 1,
    "length_penalty": 1,
    "repetition_penalty": 1.01,
    "no_repeat_ngram_size": 0,
    "compression_ratio_threshold": 2.4, # Placeholder
    "log_prob_threshold": -1.0, # Placeholder
    "no_speech_threshold": 0.5, # Placeholder
    "prefix": None, # Placeholder
    "suppress_blank": True,
    "suppress_tokens": [-1],
    "without_timestamps": True,
    "max_initial_timestamp": 1.0,
    "word_timestamps": False, # Placeholder
    "sampling_temperature": 1.0,
    "return_scores": True,
    "return_no_speech_prob": True,
    "word_aligner_model": 'tiny',
}

Of course, I explicitly change the batch size and beam size though for my benches.

BTW, I'm somewhat familiar with torch.compile...but isn't it only implemented for Linux currently, not windows?

Jiltseb commented 3 months ago

@Jiltseb Yep, same sampling turned off, trying to do a "greedy" search basically with 5 beams. obviously no temperature since it's greedy.

BTW, I'm somewhat familiar with torch.compile...but isn't it only implemented for Linux currently, not windows?

What do you mean by "greedy" search with 5 beams? Greedy decoding selects the most probable token at each step. In your experiments, please keep beam_size =1 and a single sampling temperature across all methods.

Torch.compile comes with torch so, as long as torch is working properly on windows, it should support. https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

What I am saying is there could be a potential future method with faster-whisper with reduced VRAM usage using HQQ.

BBC-Esq commented 3 months ago

@Jiltseb Yep, same sampling turned off, trying to do a "greedy" search basically with 5 beams. obviously no temperature since it's greedy. BTW, I'm somewhat familiar with torch.compile...but isn't it only implemented for Linux currently, not windows?

What do you mean by "greedy" search with 5 beams? Greedy decoding selects the most probable token at each step. In your experiments, please keep beam_size =1 and a single sampling temperature across all methods.

Torch.compile comes with torch so, as long as torch is working properly on windows, it should support. https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html

What I am saying is there could be a potential future method with faster-whisper with reduced VRAM usage using HQQ.

I revised my comment but obviously you probably got an e-mail with my old post before I had coffee in me...the parameters I control for are above that pulled from the WhisperS2T library default (except for beam size and batch size). Hope that clarified.

BBC-Esq commented 3 months ago

@Jiltseb I retested on WhisperX a batch size 40 for the small.en and base.en models, this time using a beam size of 1, and it resulted in approximately a 10% more tokens/second and minimal change in vram used (within margin of error). By contrast, I just benched HF's implementation so you can see the difference in how beam size drastically affects it...

HF, 1 beam, batch 40, base.en model = 1712.12 MB VRAM and 839.88 tokens/s

HF, 5 beams (same other parameters) = 9829.62 VRAM and 156.31 tokens/s

Since both VRAM and tok/s are roughly 5x when using 5 beams...HF's implementation is not sharing resources somehow, which is significantly different than ctranslate2's implementation. Using 5 beams with anything above the small.en model results in OOM on my 4090...

BTW I'd love to start testing WER but just have to find the time...Also, it's just a matter of time to re-test every backend with every permutation of batchsize, beam size 1, 2, 3 and so on so...

EDIT - I stand corrected...with HF, even the small.en model maxes out vram, although it doesn't trigger the OOM error:

trungkienbkhn commented 3 months ago

Could you please verify your implementation as even without enable_ta_fe=True, it needs to be faster?

@Jiltseb , Yes I confirm it's faster. Tks. I have another question, is it possible to support word_timestamps=True for multi-batch mode? I see that it is statically set to False.

Jiltseb commented 3 months ago

Could you please verify your implementation as even without enable_ta_fe=True, it needs to be faster?

@Jiltseb , Yes I confirm it's faster. Tks. I have another question, is it possible to support word_timestamps=True for multi-batch mode? I see that it is statically set to False.

Yes, this should be possible using the alignment function in ctranslate2 models, I am working on it already. Alignment can be slightly worse than the un-batched version since the matching is performed over longer sequence.

Purfview commented 3 months ago

I just quick looked at the PR, I think I see lots of stuff not actually in use by anything. And it's using external VAD when FW has internal VAD already.

Imho, this PR with many features crammed into one PR looks more suitable for a separate fork, and it's hard to decipher what actually could be incorporated without breaking compatibility. Maybe it would be wise to break this PR to separate PRs - one feature per PR.

Jiltseb commented 3 months ago

Hi @trungkienbkhn , Can you check the data quota of the repo? I guess my fork is tied to this repo's quota. VAD model (18 MB) as a Git LFS throws over the data quota error.

trungkienbkhn commented 3 months ago

Hi @trungkienbkhn , Can you check the data quota of the repo? I guess my fork is tied to this repo's quota. VAD model (18 MB) as a Git LFS throws over the data quota error.

There is no special config for this repo. However github has a general limit for adding files:

GitHub limits the maximum file size (or sizes) you can add to your repository to 50 MB.

Could you show the exact error that you encountered?

Jiltseb commented 3 months ago

Hi @trungkienbkhn , Can you check the data quota of the repo? I guess my fork is tied to this repo's quota. VAD model (18 MB) as a Git LFS throws over the data quota error.

There is no special config for this repo. However github has a general limit for adding files:

GitHub limits the maximum file size (or sizes) you can add to your repository to 50 MB.

Could you show the exact error that you encountered?

batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.  
error: failed to push some refs to 'https://github.com/mobiusml/faster-whisper.git'

trungkienbkhn commented 3 months ago

batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to push some refs to 'https://github.com/mobiusml/faster-whisper.git'

It seems that this error comes from your account limit, not from FW repo. Following this link, you can check your git LFS usage and upgrade it if needed.

Every account using Git Large File Storage receives 1 GiB of free storage and 1 GiB a month of free bandwidth. If the bandwidth and storage quotas are not enough, you can choose to purchase an additional quota for Git LFS. Unused bandwidth doesn't roll over month-to-month.

Jiltseb commented 3 months ago

batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access. error: failed to push some refs to 'https://github.com/mobiusml/faster-whisper.git'

It seems that this error comes from your account limit, not from FW repo. Following this link, you can check your git LFS usage and upgrade it if needed.

Every account using Git Large File Storage receives 1 GiB of free storage and 1 GiB a month of free bandwidth. If the bandwidth and storage quotas are not enough, you can choose to purchase an additional quota for Git LFS. Unused bandwidth doesn't roll over month-to-month.

SYTRAN limits are causing this? See here

Bandwidth and storage usage only count against the repository owner's account. In forks, bandwidth and storage usage count against the root of the repository network. Anyone with write access to a repository can push files to Git LFS without affecting their personal bandwidth and storage quotas or purchasing data packs. Forking and pulling a repository counts against the parent repository's bandwidth usage.

Usage from my account and my organization are within the limits and enough quota is available. I think the project needs to set up Git LFS.

As a workaround, could you download the model file and upload it as a GitLFS file/keep the model in your HF repo? I will make other changes and commit.

ooobo commented 3 months ago

I just quick looked at the PR, I think I see lots of stuff not actually in use by anything. And it's using external VAD when FW has internal VAD already.

Imho, this PR with many features crammed into one PR looks more suitable for a separate fork, and it's hard to decipher what actually could be incorporated without breaking compatibility. Maybe it would be wise to break this PR to separate PRs - one feature per PR.

as someone who is using faster-whisper for short-form transcription (no batching at all), I'm a bit concerned with the size of the PR and how it could break compatibility with my existing code going forward - be great if this could be separate PRs.

Jiltseb commented 3 months ago

Added below changes to the previous version:

Get word-level timestamps with optional word_timestamps=True
Get log probability and no speech probability scores for all the segments.
Added local VAD model file and removed dependency on external URL
Removed redundant info and dependencies, minor typos.

Note that in this PR, nothing is changed for the unbatched version, other than some optional parameters to select if needed.

hargunmujral commented 3 months ago

Would be great to get this PR merged soon, batching support would have a lot of impact 🙏

hargunmujral commented 3 months ago

Also being able to specify a custom VAD path would be helpful. Plus, when I tried to pass in use_vad_model as false, I got a runtime error. Can you verify / show how to run without using a vad model?

Jobus0 commented 3 months ago

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.

model = WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

segments, info = model.transcribe(
                file_path,
                beam_size=5)

Anyone else seeing a performance degradation with this when transcribing short clips without batching?

Jiltseb commented 3 months ago

Also being able to specify a custom VAD path would be helpful. Plus, when I tried to pass in use_vad_model as false, I got a runtime error. Can you verify / show how to run without using a vad model?

Each VAD can have some differences in implementations. For this reason, you can either use internal VAD (default) or provide vad_segments (a list of dict from the output of your external VAD model) in the following example format: [{'start':2.99, 'end': 25.80,'segments': [(2.99,14.50),(15.1,21.32),(22.00, 25.80)]},{...}]

hargunmujral commented 3 months ago

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.
model = WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

segments, info = model.transcribe(
                file_path,
                beam_size=5)
Anyone else seeing a performance degradation with this when transcribing short clips without batching?

I believe this is because of the use of the VAD model by default. Would it not make sense for this feature to be default deactivated? Also i have still not been able to get it working with use_vad_model set to False. It seems that there isn't an option to skip it whatsoever:

if not vad_segments:
    if self.use_vad_model:
        vad_segments = self.vad_model(
            {
                "waveform": torch.from_numpy(audio).unsqueeze(0).float(),
                "sample_rate": 16000,
            }
        )
        vad_segments = merge_chunks(
            vad_segments,
            self.chunk_size,
            onset=self.vad_onset,
            offset=self.vad_offset,
        )
    else:
        raise RuntimeError(
            "No vad segments found. Set 'use_vad_model' to True while loading the model"
        )

Jiltseb commented 3 months ago

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.
model = WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

segments, info = model.transcribe(
                file_path,
                beam_size=5)
Anyone else seeing a performance degradation with this when transcribing short clips without batching?
I believe this is because of the use of the VAD model by default. Would it not make sense for this feature to be default deactivated? Also i have still not been able to get it working with use_vad_model set to False. It seems that there isn't an option to skip it whatsoever:
if not vad_segments:
    if self.use_vad_model:
        vad_segments = self.vad_model(
            {
                "waveform": torch.from_numpy(audio).unsqueeze(0).float(),
                "sample_rate": 16000,
            }
        )
        vad_segments = merge_chunks(
            vad_segments,
            self.chunk_size,
            onset=self.vad_onset,
            offset=self.vad_offset,
        )
    else:
        raise RuntimeError(
            "No vad segments found. Set 'use_vad_model' to True while loading the model"
        )

How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset. There is no need for batching for the 5-second audio clip anyway (You can combine all of them with silence in between if you want to run it at once with batching) Making VAD deactivated would mean that you have to provide VAD segments for it to segment the audio. If you set use_vad_model to False, this means that you will provide external vad segments instead. What is your intention while setting use_vad_model to False?

MahmoudAshraf97 commented 3 months ago

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.
model = WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

segments, info = model.transcribe(
                file_path,
                beam_size=5)
Anyone else seeing a performance degradation with this when transcribing short clips without batching?
I believe this is because of the use of the VAD model by default. Would it not make sense for this feature to be default deactivated? Also i have still not been able to get it working with use_vad_model set to False. It seems that there isn't an option to skip it whatsoever:
if not vad_segments:
    if self.use_vad_model:
        vad_segments = self.vad_model(
            {
                "waveform": torch.from_numpy(audio).unsqueeze(0).float(),
                "sample_rate": 16000,
            }
        )
        vad_segments = merge_chunks(
            vad_segments,
            self.chunk_size,
            onset=self.vad_onset,
            offset=self.vad_offset,
        )
    else:
        raise RuntimeError(
            "No vad segments found. Set 'use_vad_model' to True while loading the model"
        )
How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset. There is no need for batching for the 5-second audio clip anyway (You can combine all of them with silence in between if you want to run it at once with batching) Making VAD deactivated would mean that you have to provide VAD segments for it to segment the audio. If you set use_vad_model to False, this means that you will provide external vad segments instead. What is your intention while setting use_vad_model to False?

I think if vad is not used and vad timestamps aren't provided, it should default to regular 30s chunking without any bells and whistles

Jiltseb commented 3 months ago

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.
model = WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

segments, info = model.transcribe(
                file_path,
                beam_size=5)
Anyone else seeing a performance degradation with this when transcribing short clips without batching?
I believe this is because of the use of the VAD model by default. Would it not make sense for this feature to be default deactivated? Also i have still not been able to get it working with use_vad_model set to False. It seems that there isn't an option to skip it whatsoever:
if not vad_segments:
    if self.use_vad_model:
        vad_segments = self.vad_model(
            {
                "waveform": torch.from_numpy(audio).unsqueeze(0).float(),
                "sample_rate": 16000,
            }
        )
        vad_segments = merge_chunks(
            vad_segments,
            self.chunk_size,
            onset=self.vad_onset,
            offset=self.vad_offset,
        )
    else:
        raise RuntimeError(
            "No vad segments found. Set 'use_vad_model' to True while loading the model"
        )
How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset. There is no need for batching for the 5-second audio clip anyway (You can combine all of them with silence in between if you want to run it at once with batching) Making VAD deactivated would mean that you have to provide VAD segments for it to segment the audio. If you set use_vad_model to False, this means that you will provide external vad segments instead. What is your intention while setting use_vad_model to False?
I think if vad is not used and vad timestamps aren't provided, it should default to regular 30s chunking without any bells and whistles

Do you mean uniform chunking? This can abruptly cut in the middle of the word as well causing issues in transcription. It is possible to implement some LCS-based solution (such as in transformers) around the boundary but will affect the WER. I can easily add the case to bypass vad model if audio duration is less than 30 sec though.

Jobus0 commented 3 months ago

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.

How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset.

I just now set up two new fresh projects. On the first, I ran pip install faster-whisper. On the second, I ran pip install git+https://github.com/mobiusml/faster-whisper.git. Other than that those dependencies (and sub-dependencies), they are identical.

I then ran this simple non-batched script on both, first with a 5 seconds clip, and then with a 10 minutes clip:

import faster_whisper
import time

model = faster_whisper.WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

# warm up
segments, info = model.transcribe("benchmark.wav", beam_size=5)

total_start_time = time.time()

repeats = 10
for i in range(repeats):
    start_time = time.time()
    segments, info = model.transcribe("benchmark.wav", beam_size=5)
    print(f"Elapsed time: {time.time() - start_time:.4f}")

print()
print(f"Total elapsed time: {time.time() - total_start_time:.4f}")
print(f"Average elapsed time: {(time.time() - total_start_time)/repeats:.4f}")

Results for 5 seconds clip

repository	clip length	average elapsed time	relative %
SYSTRAN (original)	5 sec	0.1837 sec	100%
mobiusml (fork)	5 sec	0.2924 sec	159%

Note: “relative %” compares inference times, with SYSTRAN as the baseline. In this case, the fork takes 59% more time, which could be translated to it being ~38% slower.

Results for 10 minutes clip

repository	clip length	average elapsed time	relative %
SYSTRAN (original)	10 min	0.8062 sec	100%
mobiusml (fork)	10 min	0.9063 sec	112%

I've reran the script many times and get consistent results.

Screenshot of the process. Left side is the original repo, right side is the fork.

benchmark

OS: Windows 11 GPU: RTX 4070 Python: 3.12

Jiltseb commented 3 months ago

When switching from the latest release (v1.0.2) to this without changing any code (so not using the batch pipeline), I'm seeing a consistent ~25% reduction to inference speed on my 5 seconds clip with CUDA.

How can I reproduce the speed difference you get? I have tried both versions and can confirm the speed is similar to the benchmarking dataset.

I just now set up two new fresh projects. On the first, I ran pip install faster-whisper. On the second, I ran pip install git+https://github.com/mobiusml/faster-whisper.git. Other than that those dependencies (and sub-dependencies), they are identical.

I then ran this simple non-batched script on both, first with a 5 seconds clip, and then with a 10 minutes clip:
import faster_whisper
import time

model = faster_whisper.WhisperModel(
            "distil-large-v3",
            device="cuda",
            compute_type="float16")

# warm up
segments, info = model.transcribe("benchmark.wav", beam_size=5)

total_start_time = time.time()

repeats = 10
for i in range(repeats):
    start_time = time.time()
    segments, info = model.transcribe("benchmark.wav", beam_size=5)
    print(f"Elapsed time: {time.time() - start_time:.4f}")

print()
print(f"Total elapsed time: {time.time() - total_start_time:.4f}")
print(f"Average elapsed time: {(time.time() - total_start_time)/repeats:.4f}")
Results for 5 seconds clip

repository clip length average elapsed time relative % SYSTRAN (original) 5 sec 0.1837 sec 100% mobiusml (fork) 5 sec 0.2924 sec 159% Note: “relative %” compares inference times, with SYSTRAN as the baseline. In this case, the fork takes 59% more time, which could be translated to it being ~38% slower.

Results for 10 minutes clip

repository clip length average elapsed time relative % SYSTRAN (original) 10 min 0.8062 sec 100% mobiusml (fork) 10 min 0.9063 sec 112% I've reran the script many times and get consistent results.

Screenshot of the process. Left side is the original repo, right side is the fork.

OS: Windows 11 GPU: RTX 4070 Python: 3.12

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Jobus0 commented 3 months ago

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Tested repos:

SYSTRAN (1.0.2): pip install faster-whisper SYSTRAN (master): pip install git+https://github.com/SYSTRAN/faster-whisper.git mobiusml (master): pip install git+https://github.com/mobiusml/faster-whisper.git

Results for 5 seconds clip

repository	clip length	average elapsed time	relative %	re-run variance %
SYSTRAN (1.0.2)	5 sec	0.1737 sec	100.0%	+/- 1.8%
SYSTRAN (master)	5 sec	0.1733 sec	99.7%	+/- 1.8%
mobiusml (master)	5 sec	0.2773 sec	159.6%	+/- 1.6%

Note: "re-run variance %" is the variance of the results from re-running the script 5 times, and explains why SYSTRAN (master) is slightly faster (99.7%), and also shows that the +59.6% difference for mobiusml (master) is not random.

Jiltseb commented 3 months ago

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Tested repos:

SYSTRAN (1.0.2): pip install faster-whisper SYSTRAN (master): pip install git+https://github.com/SYSTRAN/faster-whisper.git mobiusml (master): pip install git+https://github.com/mobiusml/faster-whisper.git

Results for 5 seconds clip

repository clip length average elapsed time relative % re-run variance % SYSTRAN (1.0.2) 5 sec 0.1737 sec 100.0% +/- 1.8% SYSTRAN (master) 5 sec 0.1733 sec 99.7% +/- 1.8% mobiusml (master) 5 sec 0.2773 sec 159.6% +/- 1.6% Note: "re-run variance %" is the variance of the results from re-running the script 5 times, and explains why SYSTRAN (master) is slightly faster (99.7%), and also shows that the +59.6% difference for mobiusml (master) is not random.

Thanks for pointing this out and confirming the issue.

I could reproduce the error and found that this is because the garbage collector tries to clear all objects when loading the audio. Setting the resampler to None and then deleting it made sure that the object is properly removed. We can avoid the manual rungc.collect() line as it was causing the delay.

After removing the manual garbage collector, this version works fine with similar run time.

@trungkienbkhn The solution to memory leak problem with gc.collect() seems to be consistent with setting resampler to None.

trungkienbkhn commented 3 months ago

I see, can you please compare and report the dev branch of faster whisper? pip install git+https://github.com/SYSTRAN/faster-whisper.git?

Tested repos:

SYSTRAN (1.0.2): pip install faster-whisper SYSTRAN (master): pip install git+https://github.com/SYSTRAN/faster-whisper.git mobiusml (master): pip install git+https://github.com/mobiusml/faster-whisper.git

Results for 5 seconds clip

repository clip length average elapsed time relative % re-run variance % SYSTRAN (1.0.2) 5 sec 0.1737 sec 100.0% +/- 1.8% SYSTRAN (master) 5 sec 0.1733 sec 99.7% +/- 1.8% mobiusml (master) 5 sec 0.2773 sec 159.6% +/- 1.6% Note: "re-run variance %" is the variance of the results from re-running the script 5 times, and explains why SYSTRAN (master) is slightly faster (99.7%), and also shows that the +59.6% difference for mobiusml (master) is not random.

Thanks for pointing this out and confirming the issue.

I could reproduce the error and found that this is because the garbage collector tries to clear all objects when loading the audio. Setting the resampler to None and then deleting it made sure that the object is properly removed. We can avoid the manual rungc.collect() line as it was causing the delay.

After removing the manual garbage collector, this version works fine with similar run time.

@trungkienbkhn The solution to memory leak problem with gc.collect() seems to be consistent with setting resampler to None.

Hello. I confirm that after removing gc.collect(), mobiusml (master) works fine with a similar runtime as SYSTRAN (original). However, it seems that replacing gc.collect() by resampler = None doesn't solve the memory leak problem. I tried this example again:

Baseline
5350.244352
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872

Decode audio once
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872
5366.55872

After changing with resampler=None
5366.464512
5387.788288
5397.89312
5410.004992
5424.2304
5387.927552
5397.909504
5410.021376
5424.2304
5439.737856
5388.161024
5398.142976
5410.254848
5424.463872
5439.91808
5387.620352
5398.765568
5412.974592
5426.147328
5439.32416
5452.496896
5465.669632
5478.846464
5492.0192
5506.564096
5517.398016
5530.570752
5543.747584
5556.92032
5481.639936
5481.762816
5481.885696
5482.008576
5482.131456
5482.254336
5482.377216
5482.500096
5491.99872
5504.380928
5517.312

Jiltseb commented 3 months ago

I got better values when replacing garbage collector with resampler to None. With the same settings as decode audio once:

1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.349184
1264.304128
1264.304128
1264.304128
1264.304128
1264.304128
1264.304128```

If you run decode_audio multiple times(Baseline):

```Baseline
1278.783488
1279.455232
1279.496192
1281.31072
1281.31072
1281.425408
1281.437696
1281.437696
1281.437696
1281.437696
1281.437696
1281.441792
1281.441792
1281.441792
1281.441792
1281.441792
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888
1281.445888```

I tried the same with SYSTRAN master. With multiple decode_audio calls:

```Baseline
762.056704
763.51488
763.51488
763.650048
763.65824
764.60032
764.604416
764.624896
764.694528
764.694528

@trungkienbkhn Can you check SYSTRAN master as well from your end?

trungkienbkhn commented 3 months ago

I switched to use GPU H100 with large-v3 model and checkkout to mobius/master branch, below is my result for 100 times:

After changing
2306.183168
2314.395648
2322.096128
2306.6624
2314.36288
2322.333696
2330.034176
2338.004992
2306.711552
2314.412032
2322.382848
2330.083328
2338.054144
2306.748416
2314.448896
2322.419712
2330.120192
2338.091008
2345.734144
2346.840064
2346.840064
2346.840064
2346.840064
2346.840064
2347.106304
2361.294848
2369.265664
2376.966144
2384.93696
2392.63744
2400.608256
2408.308736
2416.279552
2424.180736
2431.881216
2439.852032
2447.552512
2455.252992
2347.106304
2347.106304
2347.106304
2347.106304
2347.106304
2361.561088
2369.261568
2377.232384
2384.932864
2392.90368
2400.60416
2408.574976
2416.275456
2424.17664
2431.87712
2439.847936
2447.548416
2455.519232
2455.67488
2455.67488
2455.67488
2455.67488
2455.67488
2455.94112
2455.94112
2455.94112
2455.94112
2455.94112
2456.211456
2456.211456
2456.211456
2456.211456
2456.481792
2456.481792
2456.481792
2456.481792
2463.100928
2462.26944
2462.26944
2462.26944
2462.26944
2462.53568
2462.53568
2462.53568
2462.53568
2462.806016
2462.806016
2462.806016
2462.806016
2462.806016
2463.076352
2463.076352
2463.076352
2463.076352
2463.346688
2463.514624
2463.514624
2463.514624
2463.514624
2463.514624
2463.780864
2463.780864

Decode audio once: 2371.44064

My code logic:

import psutil
import gc
import sys
import faster_whisper

model = faster_whisper.WhisperModel("large-v3", device="cuda")
audio_path = "tests/data/jfk.flac"
process = psutil.Process()

def monitor_memory(audio, n=100):
    for _ in range(n):
        segments, _ = model.transcribe(audio)
        text = "".join(segment.text for segment in segments)
        print(process.memory_info().rss / 1000000)

    print("")
    gc.collect()

print("After changing")
monitor_memory(audio_path)