please correct and/or update the readme comparing other whisper implementations

BBC-Esq commented 11 months ago

Just FYI, I've messaged the folks over at faster-whisper, will for whisper.cpp as well as Jax. I'm contemplating which approach to use in my code so...

Along those lines, I'm putting in this "issue" to ask for revisions to the readme regarding comparing other whisper implementations. I'm hoping for a more apples-to-apples comparison.

For purposes of this issue, when I say "insanely-fast-whisper test" I'm referring to the test stating that it took 5 min 2 sec. And when I refer to "faster-whisper test" I'm referring to the test stating it took 9 min 23 sec. These two seem the most comparable.

Here are my requests regarding this issue:

The insanely-fast-whisper test should use large-v2 since the faster-whisper test does.
The faster-whisper test specifies a beam size of 1. Was a beam size of 1 used for the insanely-faster-whisper test? If so, please specify. If it didn't, it's misleading and not a true comparison.
Please state for the faster-whisper test a batch size of 1 since it doesn't support batching, otherwise people will think that the underlying algorithms are responsible for speed increase when it's truly the batching functionality, which is "slightly" misleading.
Even if using batching increases VRAM/RAM usage in a "sub-linear" fashion, please make it clear that there will be an increase in VRAM usage and how much...again, that's the trade off with using insanely-fast-whisper over faster-whisper, and it's due to the batching...
Please run the distil-whisper using the large-v2 with the same clarifications above so it's more clear apples-to-apples.
Please include a comparison of whisper.cpp, especially since they recently made a significant improvement.
Please include a comparison with WhisperX since it uses faster-whisper with its own batching here.
Include a test of insanely-fast-whisper using a beam size of 5 and a batching of of 1, which mimics the default settings of faster-whisper.

Overall, I'm impressed and may move to using it...However, on my personal test using an RTX 4090, I specified a batch size of 1 using insanely-fast-whisper using bettertransformer and the Sam Altman audio took approximately 10 minutes (using fp16. And I tested faster-whisper using a beam size of 5 (since I have to research how to include a beam size of 1 parameter), and it took almost the exact same time. I fail to see how you're getting roughly half the time using insanely-fast-whisper...I haven't had a chance to test the Flash Attention 2 variety yet...

It's important to me, and I'm assuming others, before spending hours upon hours revising code, to have true comparisons. Hope my suggestion comes across alright. I'm truly interested in this technology and love the work that everyone is doing...Thanks!

Purfview commented 4 months ago

From the benchmarks on the main page it's obvious where it's faster, why it's faster and where it wouldn't be faster.

Blame yourself if you fall victim to your imagination.

talipturkmen commented 4 months ago

Hahaha chill man! Why is everybody is so aggressive in this thread? 🤣 I just pointed out that the claims of OP are right.

Vaibhavs10 / insanely-fast-whisper

please correct and/or update the readme comparing other whisper implementations #82