please correct and/or update the readme comparing other whisper implementations

BBC-Esq commented 9 months ago

Just FYI, I've messaged the folks over at faster-whisper, will for whisper.cpp as well as Jax. I'm contemplating which approach to use in my code so...

Along those lines, I'm putting in this "issue" to ask for revisions to the readme regarding comparing other whisper implementations. I'm hoping for a more apples-to-apples comparison.

For purposes of this issue, when I say "insanely-fast-whisper test" I'm referring to the test stating that it took 5 min 2 sec. And when I refer to "faster-whisper test" I'm referring to the test stating it took 9 min 23 sec. These two seem the most comparable.

Here are my requests regarding this issue:

The insanely-fast-whisper test should use large-v2 since the faster-whisper test does.
The faster-whisper test specifies a beam size of 1. Was a beam size of 1 used for the insanely-faster-whisper test? If so, please specify. If it didn't, it's misleading and not a true comparison.
Please state for the faster-whisper test a batch size of 1 since it doesn't support batching, otherwise people will think that the underlying algorithms are responsible for speed increase when it's truly the batching functionality, which is "slightly" misleading.
Even if using batching increases VRAM/RAM usage in a "sub-linear" fashion, please make it clear that there will be an increase in VRAM usage and how much...again, that's the trade off with using insanely-fast-whisper over faster-whisper, and it's due to the batching...
Please run the distil-whisper using the large-v2 with the same clarifications above so it's more clear apples-to-apples.
Please include a comparison of whisper.cpp, especially since they recently made a significant improvement.
Please include a comparison with WhisperX since it uses faster-whisper with its own batching here.
Include a test of insanely-fast-whisper using a beam size of 5 and a batching of of 1, which mimics the default settings of faster-whisper.

Overall, I'm impressed and may move to using it...However, on my personal test using an RTX 4090, I specified a batch size of 1 using insanely-fast-whisper using bettertransformer and the Sam Altman audio took approximately 10 minutes (using fp16. And I tested faster-whisper using a beam size of 5 (since I have to research how to include a beam size of 1 parameter), and it took almost the exact same time. I fail to see how you're getting roughly half the time using insanely-fast-whisper...I haven't had a chance to test the Flash Attention 2 variety yet...

It's important to me, and I'm assuming others, before spending hours upon hours revising code, to have true comparisons. Hope my suggestion comes across alright. I'm truly interested in this technology and love the work that everyone is doing...Thanks!

Vaibhavs10 commented 9 months ago

Hi @BBC-Esq ,

Thanks for your message. This repo was made entirely out of volunteer time outside of my work. Whilst benchmarking sounds like a fun project. I nearly don't have enough time at the moment.

Please feel free to run the benchmarks yourself and open a PR.

I'm going to answer your questions though:

The insanely-fast-whisper test should use large-v2 since the faster-whisper test does.

large-v2 and large-v3 are exactly the same architecturally. It has no impact on the run-time.

The faster-whisper test specifies a beam size of 1. Was a beam size of 1 used for the insanely-faster-whisper test? If so, please specify. If it didn't, it's misleading and not a true comparison.

It is specified, please look at the benchmarks more closely, we use beam_size=1

Please state for the faster-whisper test a batch size of 1 since it doesn't support batching, otherwise people will think that the underlying algorithms are responsible for speed increase when it's truly the batching functionality, which is "slightly" misleading.

It is the algorithm itself; batching and the stitching of the batches is the reason why it is faster. Single file batching is quite complex and the implementation itself is the key part of making it faster, besides, better transformer and Fast Attention 2.

Even if using batching increases VRAM/RAM usage in a "sub-linear" fashion, please make it clear that there will be an increase in VRAM usage and how much...again, that's the trade off with using insanely-fast-whisper over faster-whisper, and it's due to the batching...

Feel free to run and report the numbers! Should be quite easy to do so with the batch_size params.

Please run the distil-whisper using the large-v2 with the same clarifications above so it's more clear apples-to-apples.

I'm not sure if I understand what you mean by this? Distil-whisper is a new model which is compatible with transformers implementation.

Please include a comparison of whisper.cpp, especially since they recently made a significant improvement.

Happy for someone else to run this!

Please include a comparison with WhisperX since it uses faster-whisper with its own batching here.

Same as above.

Include a test of insanely-fast-whisper using a beam size of 5 and a batching of of 1, which mimics the default settings of faster-whisper.

I have already answered above, we use beam_size = 1.

BBC-Esq commented 9 months ago

If I do the benchmarking would I then do a pull request to the readme.md, essentially? I'd probably be willing to spend the time.

With that being said, even though it's your hobby, you did decide to put up the numbers that are there so...I'd ask for more apples-to-apples comparison or not comparing to other implementations altogether. Thanks.

Vaibhavs10 commented 9 months ago

With that being said, even though it's your hobby, you did decide to put up the numbers that are there so...I'd ask for more apples-to-apples comparison or not comparing to other implementations altogether. Thanks.

In my opinion the numbers comparing Transformers with Faster-whisper 1:1 (for the reasons mentioned above) - I really am unaware of the whisper.cpp syntax and for that reason happy for someone to run that.

You can find the code I used to benchmark here: https://github.com/Vaibhavs10/insanely-fast-whisper/tree/main/notebooks

If I do the benchmarking would I then do a pull request to the readme.md, essentially? I'd probably be willing to spend the time.

Yes! 🤗

BBC-Esq commented 9 months ago

In order to do a pull request, what kind of proof would you need? I don't know how to use GoogleColab so others can check my work, only how to test on my own PC. I could provide the example code snippets displaying the parameters used, for example?

Vaibhavs10 commented 9 months ago

A script/ snippet should do!

BBC-Esq commented 9 months ago

I'm running the tests as we speak, but I noticed that you didn't respond to my question about making clearer the batch functionality performing the speed up but also increased VRAM/RAM usage. Is that something you will do, or of not, something I could do in a pull request as well?

Vaibhavs10 commented 9 months ago

Isn't it already clear with the optimisation column:

BBC-Esq commented 9 months ago

Not really.

I'll run my tests beam=1 and batch=1 for both whisper implementations. Then I'll run insanely-fast-whisper using batch=1 though 24 (your recommended setting). BTW, I tested my RTX 4090 with a batch size of 35 and it used approximately 20 GB of VRAM, so still some more wiggle room even...You'll have a graph of the vram/ram usage based on the batch size chosen. That's what I'm referring to that will help people. VRAM is crucial when people are trying to assess which technologies/models to use, as you know....sorry.

Vaibhavs10 commented 9 months ago

Okay sure! Maybe we can put your benchmarks in a separate folder (maybe benchmarks.md). wouldn't want to clog up the README and think it is worth givingin this case; I dedicated area.

Happy to look at the code and optimise it further with you.

btw, make sure to measure the generation time instead of model initialisation/ loading time (since it is not something we measure)

If you open a PR with the snippets then I can advise you for what range of parameters to use, on top of my head:

torch.dtype (fp32, fp16)
batch_size
flash attention 2 on compatible GPU
bettertransformers (for older GPU)
chunk_length

BBC-Esq commented 9 months ago

I'm using HWiNFO to run my tests and the graphs are created by GenericLogViewer. BTW, I've tested both implementations several times now and noticed with insanely-fast-whisper that CUDA usage is never 100%. Here's a pic:
However, when I use faster-whisper, it's always 100%. Seems like there's an optimization issue there...so I'll also include CUDA usage in a message between you and I, not for the updated readme...

Vaibhavs10 commented 9 months ago

(the below should apply to both faster-whisper and transformers whisper implementation)

Vaibhavs10 commented 9 months ago

Hmm.. weird! Can you ping me the snippet you are using?

BBC-Esq commented 9 months ago

Yep, makes sense (regarding separate folder and settings). I just forked the repo. Each test just for a particular batch size takes time so...,multiple that by 24+ tests I have to do so...could be awhile. Also, it's not good practice to just do one test, ever, but it'll have to do since my time is not unlimited. I'll just put a small caveat about that...

BBC-Esq commented 9 months ago

Hmm.. weird! Can you ping me the snippet you are using?

Better yet, here's the entire script:


from transformers import pipeline

model_name = "openai/whisper-large-v2"

pipe = pipeline("automatic-speech-recognition",
                model=model_name,
                torch_dtype=torch.float16,
                device="cuda:0")

pipe.model = pipe.model.to_bettertransformer()

# Process the audio file
outputs = pipe(r"[PERSONAL INFO REDACTED]\sam_altman_lex_podcast_367.flac",
               chunk_length_s=30,
               batch_size=1,
               return_timestamps=False)```

Vaibhavs10 commented 9 months ago

Ah yeah! I averaged the scores over 3 runs (you should do that too)

re: the code-snippet it makes sense, that the usage is low, since you're not using the pipeline the way it is supposed which is with higher batch_size, so it cannot fully utilise the GPU.

Gotta head out now! Looking forward to the PR. Have a great evening/ day! 🤗

BBC-Esq commented 9 months ago

With a batch size of 1, here is the sample graph for insanely-fast-whisper showing total CUDA usage average of 51.7% percent. Average VRAM usage is 34.5% of an RTX 4090 with 24 GB of VRAM, and transcription time is approximately 8 min 40 sec.

NOTE: I had to redo this test multiple times because I made a mistake before calculating the numbers, but this updated graph is now accurate.

BBC-Esq commented 9 months ago

Here's the same graph using faster-whisper with a beam size of 1 and obviously a batch size of 1 since that's all it supports. Looking at when the model is loaded into memory, you'll notice faster-whisper took approximately 7 min 35 sec (83.9% CUDA and 29.1% VRAM) whereas insanely-fast-whisper took approximately 7 min 50 sec (per the prior graph above)...My critique, therefore, still stands to a certain extent...and I sort of have new one...

Your show faster-whisper performing at 9 min 23 secs on an A100 80 GB GPU (a $6,000+ card), which was created in 2020 and has 6,912 CUDA cores and 80 GB of HBM2e memory. My RTX 4090 has 16,384 CUDA cores, 24 GB of GDDR6X memory, and costs $1,599...TEST ON RELEVANT GPUS. Nobody uses an A100 except large corporations, and nobody uses a T4 except people using GoogleColab, and who knows if it's the same GPU, they can differ with the silicon lottery.

Also, I'll work on installing (try again) Flash Attention 2...and run those tests. But if my results vary wildly I'd expect you to take down the claims on the main readme.md and replace them with a link to my pull request with a more thorough analysis. It's not worth my time, with all due respect, to do this extensive testing if it's going to remain and there only be a link to an ancillary readme.md that a user may or may not find a small link to click on.

Let me know. Regards.

BBC-Esq commented 9 months ago

I want to add that, to your credit, it's no small feat to get the batching of single file implemented...the folks over at faster-whisper/ctranslate2 have been struggling with that for some time, but it's important to verify claims of being the fastest before posting such claims and lightning bolt icons and such in a readme...

Regards.

BBC-Esq commented 9 months ago

And here's the graph for insanely-faster-whisper using a batch size of 5. approximately 2 min 20 sec, VRAM usage average of 36%, and CUDA average usage of 68.2%:

We have a winner! Batching makes a huge difference. However, I noticed that the Altman audio is absolutely pristine, highest quality I've heard. If you have low-quality audio...people speaking over one another, etc., being able to specify the number of "beams" could possibly be a huge benefit. Insanely-faster-whisper doesn't have that parameter, correct, because of the nature of batching a single file??

BBC-Esq commented 9 months ago

Testing insane-faster-whisper with a batch size of 2. Approximately 4 min 45 sec runtime, average CUDA usage of 64.5%, and average VRAM usage of 30.8%.

BBC-Esq commented 9 months ago

Testing insane-faster-whisper with a batch size of 3. Approximately 3 min 30 sec, VRAM average 32%, CUDA usage 67.3%:

We're starting to notice diminishing returns of using a much larger batch size...

BBC-Esq commented 9 months ago

Testing insane-faster-whisper with a batch size of 25. Approximately 1 min 3 sec. Approximately 52% memory usage and 68% CUDA usage. Average CUDA usage staying roughly the same (not near 100%) despite much larger batch size. However, much more VRAM is being used and, to boot, much lower transcription time. CUDA is spiking, showing a mismatch between CUDA power and the rate VRAM is sending it information:

BBC-Esq commented 9 months ago

Batch size of 35 and we start to see a bottleneck (between VRAM and CUDA compute power) as indicated by spikes of low and high usage...Transcription time only about 50 sec, VRAM average at 70%, and CUDA at 75.6% (but with spikes):

BBC-Esq commented 9 months ago

Batch size of 45. 48 second transcription. VRAM at 83.3%. CUDA at 73.2% and spiking more as the mismatch of CUDA power and VRAM's ability to send it information becomes more pronounced:

BBC-Esq commented 9 months ago

insanely-fast-whisper BEATS faster-whisper at float32, beam=1 and batch=1. This is noteworthy, especially because it lost so handily at float16...and I retested this.

faster-whisper completed in approximately 9 min 10 sec using 40.6% VRAM and 87% CUDA usage. insanely-fast-whisper completed in approximately 8 min 20 sec. using 2.7% more VRAM but 14.% LESS CUDA usage. This is significant in and of itself.

The first image is for faster-whisper and the second for insanely-fast-whisper:

BBC-Esq commented 9 months ago

FINALLY, for insanely-faster-whisper I used batch size of 24 at float16....wait for it...the transcription of only took approximately 1 minute! This is the most comparable test to where you state that you got 5 min 2 sec!

This is likely due to the larger number of CUDA cores than on the A100 (from 2020) that you tested, not withstanding that that card has 80 GB of HBM memory. Therefore, you NEED TO TEST ON RELEVANT GPUS.

I'd respectfully request that you re-test and provide comparisons on an relevant GPU for 99.9% of the people who will be using this library...not the .1% corporate people who are paying $20,000+ on ebay for a H100 GPU. Also, please compare float16 and float32 with all permutations, including batch size of 1 for insanely-fast-whisper when comparing it to faster-whisper. This ensures an apples-to-apples comparison. This is your responsibility since you posted the claim, even if it's a hobby and not part of your formal job duties at Huggingface like you stated.

Also, please include columns in your table specifying that faster-whisper essentially runs in batch size mode of 1. It's not clear, and it needs to be specified that this is an inherent limitation of faster-whisper. Also, it needs to be clear that there's an important need to specify the beam size (e.g. with lower quality transcriptions not the perfect audio file of the Altman podcast), and this needs to be recognized.

I'll commit to re-testing on my RTX 4090 (including all testing for Flash Attention 2 if you'll make this small commitment and also make it clear on the Readme my test results if they conflict with yours.

Lastly,

Regards.

BBC-Esq commented 9 months ago

https://www.width.ai/post/what-is-beam-search

BBC-Esq commented 9 months ago

Also, I'd really appreciate it if you could explain how it performs on CPU-only, and whether it supports things like this:

https://opennmt.net/CTranslate2/hardware_support.html?highlight=mkl

I do appreciate your honesty, however, regarding its much slower performance on MPS than when using CUDA...and how batch size can't really be utilized with MPS for whatever reason.

BBC-Esq commented 9 months ago

All tests done in float16 unless otherwise noted. HwInfo used to monitor metrics and GenericLogViewer to create averages. VRAM usage includes baked in overhead of 2.9 GB used by computer monitor and whatever else. Maintained constants, not running any programs during the tests, not evening opening a new browser tab:

NOTE: The table had to be updated due to a human error, but this data is now accurate.

iSuslov commented 9 months ago

Wow, interesting journey here. I think everybody will benefit from your work @BBC-Esq, thank you! A noob question from me. Does batch size affect the quality of transcriptions?

BBC-Esq commented 9 months ago

You're welcome. The more I thought about it I became kind of pissed about this developer's attitude. I know his type. He needs to change the readme to have actual thorough testing that's both relevant and not so egotistical...but his type of personality usually doesn't, but we'll see I suppose. His tests are NOT apples-to-apples and he needs to be professionally courteous to others who work hard on their various projects. Give a tip of the hat to a fellow gentlemen in a friendly competition, don't dump on the other guy with false and misleading tests comparing another technology.

Also, he lied to me, claiming that this isn't part of this job...it's just a hobby...thus he can't take the 5 minutes it'd take to revise the readme...hence he wants me to spend an entire day of my life testing for him, to aggrandize him...and he MIGHT accept a pull request with my data...who knows...guess I can only pray.

BS.

He's employed by Huggingface. He works with Chandra, the guy who did the Jax implementation of Whisper (no qualms with that), his code is part of the Transformers library, his employer...and so on.

Anyways, I didn't mean to get triggered again so...moving on!

TO ANSWER YOU QUESTION. As I understand it, and keep in mind this is not my profession, but as I understand it, it's not so much that a batch size LOWERS the quality as its not having a beam size parameter, which INCREASES the quality. However, doing batching does contribute because when multiple batches are being processed simultaneously they can't look at the beginning/end of one another, which is necessary to predict the next token/word...thus there's a loss of quality there as well.

Also, in the Sam Altman audio this developer tests with, the audio is pristine, it's perfect...the people are using $1,000 mics 2 inches from their mouths...in a soundproofed studio...and the audio is curated afterwards...I'm exaggerating a little, but you get the point.

You need beam search, as I understand it, for things like poor audio quality...crackling customer service calls...someone recording from their cellphone at a distance...people talking over one another...stuff like that. Here are a few links I found, coincidentally one from Huggingface too, who is this developer's employer.

https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/

https://huggingface.co/docs/transformers/generation_strategies

HOW MUCH it matters is unknown to me, however, and I'm not an expert. You'd have to test with a bunch of different audio samples to even start to get an idea, but to claim you're comparing apples to apples and then not even discuss beam size (or give credit where credit is due) is another way this developer is being disingenuous. It's like saying "MY CAR IS THE FASTEST ONE OUT THERE EVERYBODY! LOOK HOW SLOW THE OTHER CARS ARE! LOOK AT ME! LOOK AT ME AND MY EGO!" And then fail to mention that your car, while fast, doesn't have brakes like the other cars...which is an important feature.

Also, let's presume the large-v2 model's quality is so high that for 95% of the use-cases NOT having a beam size parameter doesn't significantly harm the output. Let's say 2% of the common use-cases are negatively affected. If you switch to the medium.en, small models...that might jump to 20%, 50% and so on, because the baseline quality of the model is much lower.

SPEED IS NOT THE ONLY CONSIDERATION. Don't falsely claim to compare apples-to-apples, omit relevant features of another person's work, just to pump your own work. We're all supposedly all striving to improve this space and get some personal aggrandizement along the way, some pats on the back, but not feed one's ego by misrepresenting things...My main grief.

By the way, here's a direct comparison just fyi: float16, beam=1, batch=1. Don't pay attention to the average, however. I didn't chop off the ends of the graph. With GenericLogViewer you have to chop off the ends of a graph so it won't use the ends as part of the average shown...but this is to just give you an idea:

faster-whisper in red, insanely-fast-whisper in green:

BBC-Esq commented 9 months ago

By the way, I did actually, manually, compare the transcriptions for both whisper implementations for quality. I extracted the .json output from insanely-fast-whisper, pasted into Microsoft WORD. I then copied and pasted the transcription from faster-whisper. Within WORD, I then did a redline comparison.

They were the same quality overall. Both had roughly the same number of errors. What was interesting, however, is that they both had their own unique errors regarding certain words. lol.

Anyways, this is with the perfect, unrealistic, audio that this developer uses. As you get lower in audio quality and/or use a lower quality whisper model, I'd bet that you start seeing the advantage of having a beam size parameter. Again, this would take a fair amount of testing that I don't have time for, but it's a worthwhile thing to mention...or you can do what this guy does and put cool pictures of bullet trains...because who doens't like trains! And lightning bolts! And call your project "insane"! Dude! It's radical dude!

Um no...be respectful, professional, and a little humble in touting your own work and recognize others. AND, be accurate in your testing.

BBC-Esq commented 9 months ago

Wow, interesting journey here. I think everybody will benefit from your work @BBC-Esq, thank you! A noob question from me. Does batch size affect the quality of transcriptions?

Another cool link you might find interesting from this developer's employer nonetheless:

https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching

sanchit-gandhi commented 9 months ago

Hey @BBC-Esq! Thanks for the super comprehensive benchmarks! I love the in-depth analysis you've done with the GPU logging to work out what the best set-up is for you. Would be open to sharing the recipe that you followed, such that other people in the community can re-run your benchmark methodology and determine what the best option is for them? This would be a super valuable contribution IMO, since anyone could tune the hyper-parameters to maximise throughput in their deployment setting.

As open-source developers, our goal is to give the community the relevant tools they need for their use-cases. We then try our best to publish benchmark results that users can interpret to determine the best set-up for them and their deployment configuration. However, it's extremely difficult for us to cover every single permutation.

For example, we've found in the past that switching PyTorch versions on the same hardware can give different results for VRAM and latency. This means the search-space for all possible permutations is huge (all CUDA versions x all pytorch versions x all hardware options)! For the indicative performance results that we provide, we try to give a fair comparison across implementations, providing all the details of our set-up and any caveats with the comparison. This then forms a starting point for users to decide what options to experiment with and some knowledge about what is likely to work well. It's always hard to get 1-to-1 the same results because of how many set-up permutations there are. E.g. if I were to use your best configuration on the same RTX 4090 hardware with the same CUDA version but a different PyTorch version, I'd likely get different results to you. This doesn't invalidate your benchmark. Your benchmark provides the best configuration for your specific set-up. But it's hard to make the benchmark generalise to all configurations. Therefore, we encourage users to experiment for their precise deployment configuration to work out what the best option is for them.

I understand your frustration in not having benchmark details for your specific set-up, but ask you to understand the difficulty in us providing numbers for every possible case. Probably what we could do more of is providing the tools for users to run this hyper-parameter selection themselves, to optimise their throughput. This is why I think your method would be a valuable contribution to this repository and would love to help you add it!

sanchit-gandhi commented 9 months ago

Activating timestamps will improve the transcription time further, since the model will hallucinate less. Also, specifying the language and task seems to help with WER performance:

from transformers import pipeline

model_name = "openai/whisper-large-v2"

pipe = pipeline("automatic-speech-recognition",
                model=model_name,
                torch_dtype=torch.float16,
                device="cuda:0")

pipe.model = pipe.model.to_bettertransformer()

# Process the audio file
outputs = pipe(r"[PERSONAL INFO REDACTED]\sam_altman_lex_podcast_367.flac",
               chunk_length_s=30,
               batch_size=45,
               return_timestamps=True,
               generate_kwargs={"language": "en", "task": "transcribe"},
)

For beam search, you can do:

from transformers import pipeline

model_name = "openai/whisper-large-v2"

pipe = pipeline("automatic-speech-recognition",
                model=model_name,
                torch_dtype=torch.float16,
                device="cuda:0")

pipe.model = pipe.model.to_bettertransformer()

# Process the audio file
outputs = pipe(r"[PERSONAL INFO REDACTED]\sam_altman_lex_podcast_367.flac",
               chunk_length_s=30,
               batch_size=45,
               return_timestamps=True,
               generate_kwargs={"language": "en", "task": "transcribe", "num_beams": 5},
)

BBC-Esq commented 9 months ago

You and I had been speaking over at the Jax "Issue" place...thanks for popping over here to address this issue.

I will provide my benchmarking approach, but first.

(1) I understand that it's difficult to test all permutations and I don't expect perfection. However, when a person specifically compares it to another technology and, in my personal opinion, does such an excellent job of not even trying to do sufficient testing, it deserves my critique.

(2) To state it differently, this developer's testing was so poor, his attitude so poor, and his blustering so annoying, it deserved my full attention. Your testing is not perfect, neither is mine...I'm talking about a "range of reasonableness" here that he failed to achieve...,and more importantly, he didn't give a (&*^ that he failed to achieve it when I tried to point it out.

(3) Therefore, I understand testing isn't perfect and it's difficult/impossible to test all permutations, but you need to understand that I understand that...so you can accurately understand my point.

With that being said, my testing rubric was as follows:

(1) Here's the script for all tests regarding insanely-fast-whisper:

from transformers import pipeline

model_name = "openai/whisper-large-v2"

pipe = pipeline("automatic-speech-recognition",
                model=model_name,
                torch_dtype=torch.float16,
                device="cuda:0")

pipe.model = pipe.model.to_bettertransformer()```

# Process the audio file
outputs = pipe(r"C:\PATH\Scripts\ChromaDB-Plugin-for-LM-Studio\v2_7 - working\TEST2\sam_altman_lex_podcast_367.flac",
               chunk_length_s=30,
               batch_size=1,
               return_timestamps=False)

For testing faster-whisper I used a program I created located at my repo. Specifically, I used the functionality of my program that stems from the transcribe file script located here. However, I had to modify line 60 to include the beam size parameter since I hadn't been using it (and hence the default beam size of 5 was being used).

HwInfo was used to monitor the metrics, which are saved to CSV file. "GenericLogViewer" was used to display the graphs.

You have to chop off the ends of the graphs to show when the model is actually loaded into memory versus the time between when you start monitoring and the model is loaded...This gives a correct average.

That's pretty much it.

By the way, I appreciate your comments. However, I'm unwilling to be goaded into "my testing is insufficient" in the face of the testing of insanely-fast-whisper being so pathetic. It's NOW his turn to revise his testing...give actual accurate numbers...give true apples-to-apples comparisons. I've spent 6 hours so far testing...and it's borderline disrespectful to muddy the issue as if it's just "oh, everybody's testing is insufficient...BBC, you could improve your testing...let's look at that, let's focus on that).

No thanks, sir, make some corrections on insanely-fast-whisper first, and then we'll talk, otherwise, I'll conclude that this egotist is more concerned about being right than what the truth actually is.

The two things are not equivalent...his testing is pathetic...mine, perhaps flawed, but at least a genuine attempt...from a non-technical person nonetheless. He should be held to a HIGHER STANDARD than me, who didn't now what Python was 5 months ago hardly.

Hope I'm clear now.

Regards.

BBC-Esq commented 9 months ago

Activating timestamps will improve the transcription time further, since the model will hallucinate less. Also, specifying the language and task seems to help with WER performance:

from transformers import pipeline

model_name = "openai/whisper-large-v2"

pipe = pipeline("automatic-speech-recognition",
                model=model_name,
                torch_dtype=torch.float16,
                device="cuda:0")

pipe.model = pipe.model.to_bettertransformer()

# Process the audio file
outputs = pipe(r"[PERSONAL INFO REDACTED]\sam_altman_lex_podcast_367.flac",
               chunk_length_s=30,
               batch_size=45,
               return_timestamps=True,
               generate_kwargs={"language": "en", "task": "transcribe"},
)

For beam search, you can do:

from transformers import pipeline

model_name = "openai/whisper-large-v2"

pipe = pipeline("automatic-speech-recognition",
                model=model_name,
                torch_dtype=torch.float16,
                device="cuda:0")

pipe.model = pipe.model.to_bettertransformer()

# Process the audio file
outputs = pipe(r"[PERSONAL INFO REDACTED]\sam_altman_lex_podcast_367.flac",
               chunk_length_s=30,
               batch_size=45,
               return_timestamps=True,
               generate_kwargs={"language": "en", "task": "transcribe", "num_beams": 5},
)

Interesting...so you CAN do beams, that's cool. I'll check that out! Thanks.

BBC-Esq commented 9 months ago

I might have forgot to mention...I used CUDA 11.8 in my testing @sanchit-gandhi Python 3.11.6.

Vaibhavs10 commented 9 months ago

Ayy @sanchit-gandhi - Thanks a ton for taking care of the questions! 🤗

Hey @BBC-Esq - Brilliant work on the analysis. Would you happen to have a markdown version of these benchmarks you showcased here? https://github.com/Vaibhavs10/insanely-fast-whisper/issues/82#issuecomment-1833024914

If you cannot, then I'm happy to convert it myself. -> Will add them along with your script there as well. I think the only thing it is missing is a script to benchmark FA2.

Your comments here and in the Whisper-JAX repo indicate that I should update the README to demystify the magic a bit more and explain what goes on under the hood. I'll update it over the weekend.

Thanks again for your contributions and for spending time stress-testing the implementation.

P.S. I'm sorry if my comments came across as egotistic to you and more so sorry for the terse responses earlier.

(I'm keeping this issue open till I get around to updating the benchmarks)

BBC-Esq commented 9 months ago

Thanks for the compliment and for not escalating the matter. I was a little heated.

With that being said, before I start contributing I'd like to know a few things. First, did you actually develop any of the technology like the batching, because what you said earlier was misleading in that regard not to mention your github page is misleading. As I understand it, you only helped develop the command line functionality, none of the underlying technology like Transformers, Optimum BetterTransformer, Transformers Pipeline or what have you? In fact, it appears from the credit to others that you give on the github, that you didn't even build the CLI entirely either.

So basically, you work at HuggingFace, your employer builds these great technologies, then in your "off time" you make a website promoting them as "insanely" and "blazingly" fast to bring repute to HuggingFace. Informally, implicitly, it's expected that employees there have repos with Huggingface technologies and pump them up, which is apparent by you and your organizations ability to get 2.5k starts for "your" repo.

Thus, before I contribute I want to know what Huggnface's mission statement is. I'm not interested in contributing my time to a a greedy selfish for-profit corporation. If it's a non-profit and, despite reasonable flaws, has an altruistic mission that I agree with overall, I'd be happy to contribute. But just so we're clear, you did not develop any of the underlying technology used to actually speed up Whisper transcriptions; that was whoever worked on the underlying technologies. I did notice in some of your code you explicitly attributed WhisperX for the diarization and stuff...so I'm not trying to be impartial here. But since we got off on the wrong foot I want to clear the air a little. Once that's done I'd be happy to contribute if you'll commit to changing the so-called test results, which you already stated you would.

BBC-Esq commented 9 months ago

I believe I found the answer online. Apparently, Huggingface, your employer, is a French-US company based in New York and it's for-profit:

SEE HERE

SEE ALSO HERE

I'm not interested in providing free labor for Huggingface since it's a for-profit company. I even sometimes hesitate sometimes for non-profits. I'm not interested in helping you do your job!

Your repository, while it feigns to be your own pet project, in reality has the purpose of promoting the things that Huggingface develops. While there's nothing inherently wrong with that, you portraying it as completely separate and having nothing to do with your work at Huggingface is, simply put, false. The same applies to @sanchit-gandhi's repository for Jax, although at least he doesn't claim that it has nothing to do with his work at Huggingface...

In short, if either you or @sanchit-gandhi worked at Microsoft/Apple, your repositories exclusively showcasing Huggingface technologies WOULD NOT EXIST...So let's be honest about the genesis of them, regardless of whether you're explicitly required under your job description when Huggingface hired you...

If you want to pretend that it's an altruistic thing you're doing that has nothing to do with Huggingface's for-profit motives, perhaps to induce free labor from others, that's fine, but I will not be contributing to your code base nor that of Huggingface's. That is, unless Huggingface wants to cross-promote and throw some shoutouts to my project as well!

With that being said, I am wiling to help you correct your testing procedures to the extent I'm able, provided you were genuine with you willingness to do so. I believe it's important not to "call out" other technologies like ctranslate2/faster-whisper without robust tests. I would ask that you include my two scripts like you proposed and create the markdown yourself, like you proposed...

If you'll do that, what I suggest is...I can extract out the logic in the script used to test faster-whisper into a standalone script, that way people don't have to use my program to use it...and then submit a pull request so people can run it easier. But I'd expect the readme to be changed with the updated testing I did before that happens.

Eventually, I'd like to also submit a pull request of a script I create that allows a user to test insanely-fast-whisper all at once. They run the script, for example, and it runs multiple tests and saves the results to a .csv, for example. I had to do each test manually...which was very tedious.

Let me now dude!

p.s., I'd also like to create a similar script for the faster-whisper as well. Basically, both scripts woulds test all the beam sizes, batch sizes, etc. and save the results from each test.

Vaibhavs10 commented 9 months ago

Hi @BBC-Esq - I don't know what I have done to warrant this kind of response from you. You are free to fork the project (or use any other implementation like faster-whisper, whisperX and so on) and do as you please.

I'm closing this issue since your responses have greatly affected me.

P.S. Thanks for absolutely ruining my evening with your comments!

Cheers!

BBC-Esq commented 9 months ago

I apologize that it affected you to that extent. Will you be planning to amend the tests so I can submit some additional testing like I suggested? I would appreciate that as a man of your word. And I would like to contribute as long as its done ethically professionally.

flexchar commented 8 months ago

Hi @BBC-Esq, I hope this message finds you well.

First of all I wanted to thank you for taking the initiative to conform the benchmarks across variety of projects. I am an independent indie developer in the context of AI tech and I find it sometimes to be challenging to compare apples to apples.

However I felt that your tone of voice, accusations and complete approach is just NOT acceptable. Open source is rarely well paid, if paid at all. It is done out of one's heart and desire to bring to the community. Whether the creators of this, jax or another projects work at HuggingFace is the most irrelevant fact. Just as it doesn't matter by whom I am employed. Please do think it through in your future journeys.

Furthermore, I would like to shed light on A100/H100/TPUs. You're absolutely right that the main consumers of such hardware are the big corporations. However NOT all. I am, as a self-funded, used both A100, H100 and TPU v3 (Google did not grant access to 4 unfortunately). I was very excited and very grateful to have had a chance to transcribe over several thousand of hours of content from my favorite creators to create a searchable database for my personal use. The total cost was less than 50usd.

TPU v3 costs around 2usd/hour on Google Cloud which, over the span of 4 hours, I transcribed several hundreds of content. Cards such as A100 are easily available on RunPod/TensorDock/Modal Labs which also cost anywhere from 1.5-5usd/hour. Same story. H100 however is pricey, less available and highly diminishing returns on Whisper.

The other day I was helping out someone on the Discord group who, for his own personal use, just wanted to spin up A100 to serve Mistral in the ocean of existing APIs and providers. It doesn't matter how much a card costs because even RTX 4090 is more than my salary but even if it was a third of it, I'd not consider buying one at all. Many of practical cases implementations like this help, is for developers hacking around, learning and trying to glue some fun stuff. The benefit of renting one for a few hours is all we need, not the box of H100 running under my table.

To the contrary, in my experience, a corporation like you refer to, will not even bother finding out which framework is the fastest. They will just pay a heck load of money directly to OpenAI or Azure and use their API endpoint, which is not fast at all. 🤣

Finally, back to the creators of these libraries, it takes a lot of time to build something like this. Sometimes it's a dream and vision that happens at 5AM and sometimes it's an idea that sparked thanks to the on-going project at their work. However it is always provided as-is. You, me, he, she, it paid zero for it. Therefore WE HAVE NO RIGHT to demand for anything. Yet the authors have displayed extremely professional approach to communicating and attempting to collaborate.

As an idea, why not to make an awesome-whisper-benchmarks repository that we would all love to bookmark? :)

BBC-Esq commented 8 months ago

I appreciate your message and I agree my tone could have been better. But I think that if you review the messages between myself and this developer you'll find that he never carried through in even attempting to get more accurate comparisons. There's a new "issue" where he discusses possible ways to benchmark it with @sanchit-gandhi, who's an excellent programmer BTW, but no follow through. I think what triggered me was he ostentatious claims he makes without having competent benchmarks, then saying he will revise them, and then trying to get me to revise them for him with the "hope" that he'll accept my pull request.

Huggingface is a large company this developer is not a humble solo-practicing programmer. He's presumably paid a fair amount to do his job, and it's his job to have competent benchmarks, and more importantly, not to delegate to the general public to do the benchmarks for him.

I should have communicated that in a more friendly manner, but the facade that he's a humble programmer who's not paid for his work, and that he's not responsible for shoddy benchmarks, or to follow through with what he says, is just not accurate.

I am a hunble solo practicing programmer/hobbyist; he is not. And neither is @sanchit-gandhi . They all work at Huggingface, which I like a lot, but it's a huge for-profit company that pays these guys...

flexchar commented 8 months ago

Yeah Blair, the trap is you believe that hugging face is involved here.

"P.P.S. This project originally started as a way to showcase benchmarks for Transformers, but has since evolved into a lightweight CLI for people to use. This is purely community driven. We add whatever community seems to have a strong demand for!" [from README.md]

I double checked the README and this is a prototype project provided by a bright person for the community as-is. Therefore any claims may be made and they're still allowed to. That's the first amendment I believe.

If HuggingFace or another company was out there to pay for us, developers, to do side projects for the community without taking any credit, then I wish to know!!

Hope you make an independent repository with all the benchmarks, it'd be super cool to see in one place. Happy new years!

BBC-Esq commented 8 months ago

Oh yeah, I love the First Amendment. I used to speak on panels at cons about the First Amendment. It totally allows for him to put up shoddy, poorly-implemented, lazy, and, most importantly inaccurate, tests for the public to consume, just like he has. I suppose the First Amendment also covers disgruntled comments complaining about these tests as well.

What I'm referring to is not merely being satisfied with what the bare minimum of the First Amendment protects...but rather trying to do a little more than the bare minimum.

Purfview commented 8 months ago

ROFL, this thread made my day, the insanely demanding customer of insane... 😆 OP was acting angry when I refused to add his nonsense to my repo.

Btw, there is a delete button. Here I just would have issued a banana prize.

talipturkmen commented 2 months ago

This is hilarious, I wish I saw this thread before 😄. @BBC-Esq is right though. Many of the boosts compared to other libraries claimed by the author are not fair-comparison and you will never achieve them. In my tests the performance is really on par with whisper-x/faster-whisper. I appreciate author's the contribution to open source but the all these misleading/shallow advertising are annoying.

flexchar commented 2 months ago

I had forgotten about this thread (got notifcation) but I happened to have recently ran transcripts and written down my own observations. Perhaps someone will find them useful.

(worth mentioning, chunks of 60sec were wrong and resulted in data loss - 25sec are the correct ones for actual output)

Benchmarks are notoriously hard because of the variety of variables. While OP is right, it is extremely rude to blame free as-is software. Instead, it would be wonderful if we could build an open source collection of scripts that we could evaluate various frameworks.

I run my own transcript on Modal.com infrastructure. Script is from https://huggingface.co/distil-whisper/distil-large-v3. With flash attention 2.

Vaibhavs10 / insanely-fast-whisper

please correct and/or update the readme comparing other whisper implementations #82