SubtitleEdit / subtitleedit

the subtitle editor :)
http://www.nikse.dk/SubtitleEdit/Help
GNU General Public License v3.0
8.92k stars 916 forks source link

faster-whisper integration ? #6816

Closed ras0k closed 1 year ago

ras0k commented 1 year ago

I did not read the whole thread about whisper GPU but can we avoid a lot of problems with VRAM and speed by switching to faster-whisper maybe ?

Purfview commented 1 year ago

How faster-whisper speed [on GPU] compares to whisper-ConstMe?

Purfview commented 1 year ago

I asked about whisper-ConstMe, not "openai/whisper". Btw, I find large model's timestamps way less accurate than medium, when transcription is not better.

whisper-ConstMe can use any model.

ras0k commented 1 year ago

full benchmarks :

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.

This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

Benchmark For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations:

openai/whisper@6dea21fd whisper.cpp@3b010f9 faster-whisper@cce6b53e

Large-v2 model on GPU

Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory -- | -- | -- | -- | -- | -- openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB faster-whisper | int8 | 5 | 59s | 3091MB | 3117MB

Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.

Small model on CPU

Implementation | Precision | Beam size | Time | Max. memory -- | -- | -- | -- | -- openai/whisper | fp32 | 5 | 10m31s | 3101MB whisper.cpp | fp32 | 5 | 17m42s | 1581MB whisper.cpp | fp16 | 5 | 12m39s | 873MB faster-whisper | fp32 | 5 | 2m44s | 1675MB faster-whisper | int8 | 5 | 2m04s | 995MB

Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.

Purfview commented 1 year ago

how much faster is constme compared to openai/whisper ?

If it ran for me I wouldn't ask. Why you are posting these pointless posts?

ras0k commented 1 year ago

Why you are posting these pointless posts?

because this would help the software ? why would we not integrate faster-whisper ? why are you so adversarial to contribution on an open-source project ?

Purfview commented 1 year ago

because this would help the software ? why would we not integrate faster-whisper ? why are you so adversarial to contribution on an open-source project ?

I asked a question, you answer with some irrelevant posts. I'm adversarial to nonsense...

ras0k commented 1 year ago

I asked a question, you answer with some irrelevant posts. I'm adversarial to nonsense...

I think const-me/whisper is a Windows port of the whisper.cpp implementation. Which in turn is a C++ port of OpenAI's Whisper automatic speech recognition (ASR) model.

Purfview commented 1 year ago

I think const-me/whisper is a Windows port of the whisper.cpp implementation. Which in turn is a C++ port of OpenAI's Whisper automatic speech recognition (ASR) model.

Are you a GPT-3 bot tuned on 4chan?

ras0k commented 1 year ago

How faster-whisper speed [on GPU] compares to whisper-ConstMe?

at least 5x faster on CPU and 10x faster if you use GPU

Purfview commented 1 year ago

at least 5x faster

at least 5x faster on CPU and 10x faster if you use GPU

So, few minutes ago you didn't knew what whisper-ConstMe is, and now you are posting "benchmarks" out of you ass...

how was my post irrelevent ? it's a port of whisper.cpp and the benchmark is testing whisper.cpp

If you are not a bot then clearly with some mental deficiency.

Purfview commented 1 year ago

the benchmarks are from the repo and i am autistic.

I see... Take ten deep breaths and no need to type more posts. I'm unsubscribing from this thread.

rsmith02ct commented 1 year ago

I'm not sure why this post devolved into insults instead of mutual understanding.

whisper-ConstMe is a GPU-enabled implementation of Whisper. Does faster-whisper provide any benefits in terms of speed or accuracy or GPU ram usage compared to it? ConstMe is already integrated into SubtitleEdit which is why the question is relevant.

ras0k commented 1 year ago

I'm not sure why this post devolved into insults instead of mutual understanding.

whisper-ConstMe is a GPU-enabled implementation of Whisper. Does faster-whisper provide any benefits in terms of speed or accuracy or GPU ram usage compared to it? ConstMe is already integrated into SubtitleEdit which is why the question is relevant.

Yes it does go about 5x faster with the optimizations they provided, that is what the benchmark I posted show, they are on the faster-whisper GitHub. You can also use Whisper-cTranslate2 directly

ras0k commented 1 year ago

or GPU ram usage compared to it?

we also save a lot of VRAM which means we can run large-v2 on 4gb GPUs

ras0k commented 1 year ago

Btw, I find large model's timestamps way less accurate than medium, when transcription is not better.

maybe for english medium is fine but for multilingual large-v2 is a lot more useful than having to download a specific model for each language

rsmith02ct commented 1 year ago

Const-Me also has huge speed boosts over CPU-only implementations. I'll assume ConstMe and Faster Whisper are comparable unless someone reports data to the contrary.

ras0k commented 1 year ago

Const-Me also has huge speed boosts over CPU-only implementations. I'll assume ConstMe and Faster Whisper are comparable unless someone reports data to the contrary.

can you provide a benchmark that shows this ?

ras0k commented 1 year ago

I am talking about 5x speed on GPU vs GPU, not cpu vs gpu

rsmith02ct commented 1 year ago

Const-me is a GPU implementation of CPP that is much faster.

David M's experience here: https://www.youtube.com/watch?v=RRF5AS6JVtI&list=PLG8jlFKr-RtdO_r3YAp9cncEEqJRkIltB&index=83

I used CPP on my GTX 1050/Intel i7-7700HQ laptop on a 2:35 minute file in 13 seconds (tiny.en model). ConstMe was about 5 seconds

With the Base model and the same 2:35 file Const-me. 8 seconds CPP: 22 seconds OpenAI (Python) no GPU: 52 seconds.

Larger models may show more difference but my GPU only has 4GB ram.

I don't see a need to test Faster Whisper unless it gets embedded with SubtitleEdit. Feel free to test it and report results.

On Wed, Apr 12, 2023 at 9:59 PM ras0k @.***> wrote:

Const-Me also has huge speed boosts over CPU-only implementations. I'll assume ConstMe and Faster Whisper are comparable unless someone reports data to the contrary.

Const-Me is whisper.cpp which is a CPU-only implementation, no ? whiper.cpp is in the benchmark

— Reply to this email directly, view it on GitHub https://github.com/SubtitleEdit/subtitleedit/issues/6816#issuecomment-1505233053, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5EWOZOMNH6M3F4B3C4V3V3XA2RKPANCNFSM6AAAAAAW2Y4LVI . You are receiving this because you commented.Message ID: @.***>

ras0k commented 1 year ago

I understand and respect your desire to not ponder on this but for me a tiny benchmark is completely useless, i am only talking about comparing whisper (in gpu mode) and faster-whisper (also gpu mode) on large-v2 because I believe there will be a use-case for a lot of users. I will do my best to provide the benchmarks that you asked me soon.

ras0k commented 1 year ago

Larger models may show more difference but my GPU only has 4GB ram.

you can already try large-v2 on faster-whisper with your GPU, is that not incentive enough to want it ?

rsmith02ct commented 1 year ago

You are the one who wants this implementation of Whisper to be included yet you haven't provided any test data to show how it is better than the current options.

I care about workflow, not absolute speed. If it's not in SubtitleEdit or integrated into my NLE I don't have any reason to use it as it will slow me down.

How much VRAM do you need for the large v2 model in Faster Whisper? That may limit its interest to users.

On Wed, Apr 12, 2023 at 11:36 PM ras0k @.***> wrote:

I understand and respect your desire to not ponder on this but for me a tiny benchmark is completely useless, i am only talking about comparing whisper (in gpu mode) and faster-whisper (also gpu mode) on large-v2 because I believe there will be a use-case for a lot of users. I will do my best to provide the benchmarks that you asked me soon.

— Reply to this email directly, view it on GitHub https://github.com/SubtitleEdit/subtitleedit/issues/6816#issuecomment-1505391051, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5EWOZO423QTCNZ23RHBJ3TXA24VNANCNFSM6AAAAAAW2Y4LVI . You are receiving this because you commented.Message ID: @.***>

ras0k commented 1 year ago

How much VRAM do you need for the large v2 model in Faster Whisper? That may limit its interest to users.

3.09 GB

https://huggingface.co/guillaumekln/faster-whisper-large-v2

ras0k commented 1 year ago

you haven't provided any test data to show how it is better than the current options.

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.

This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

I posted the full benchmarks up there in my first reply but I will also try subtitleEdit and post my results soon.

rsmith02ct commented 1 year ago

Is that the needed VRAM or the model size? I can't run the medium model (1.5gb) on my 4GB GPU FWIW.

On Wed, Apr 12, 2023 at 11:41 PM ras0k @.***> wrote:

How much VRAM do you need for the large v2 model in Faster Whisper? That may limit its interest to users.

3.09 GB

https://huggingface.co/guillaumekln/faster-whisper-large-v2/tree/main

— Reply to this email directly, view it on GitHub https://github.com/SubtitleEdit/subtitleedit/issues/6816#issuecomment-1505400137, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5EWOZKP5CJG2Z4MK6YJN2DXA25KLANCNFSM6AAAAAAW2Y4LVI . You are receiving this because you commented.Message ID: @.***>

ras0k commented 1 year ago

Is that the needed VRAM or the model size? I can't run the medium model (1.5gb) on my 4GB GPU FWIW. On Wed, Apr 12, 2023 at 11:41 PM ras0k @.> wrote: How much VRAM do you need for the large v2 model in Faster Whisper? That may limit its interest to users. 3.09 GB https://huggingface.co/guillaumekln/faster-whisper-large-v2/tree/main — Reply to this email directly, view it on GitHub <#6816 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5EWOZKP5CJG2Z4MK6YJN2DXA25KLANCNFSM6AAAAAAW2Y4LVI . You are receiving this because you commented.Message ID: @.>

oh sorry yes model size, I am not sure about VRAM use I will try right now but if you take the time to read the benchmarks I posted they say 4.8gb or 3.1gb depending on fp16 or int8

Large-v2 model on GPU

Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory -- | -- | -- | -- | -- | -- openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB faster-whisper | int8 | 5 | 59s | 3091MB | 3117MB

Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.

ras0k commented 1 year ago

Is that the needed VRAM or the model size? I can't run the medium model (1.5gb) on my 4GB GPU FWIW. On Wed, Apr 12, 2023 at 11:41 PM ras0k @.> wrote: How much VRAM do you need for the large v2 model in Faster Whisper? That may limit its interest to users. 3.09 GB https://huggingface.co/guillaumekln/faster-whisper-large-v2/tree/main — Reply to this email directly, view it on GitHub <#6816 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5EWOZKP5CJG2Z4MK6YJN2DXA25KLANCNFSM6AAAAAAW2Y4LVI . You are receiving this because you commented.Message ID: @.>

just FYI right now I am testing large model on contMe and it's about 4.2GB so medium should run on your 4GB gpu, medium shows as about 2.3 GB usage max

ras0k commented 1 year ago

my GPU is a 2060 6GB

ras0k commented 1 year ago

for english medium.en is fine but for french not even large works so I really need large-v2 for it's multilingual capacities

ras0k commented 1 year ago

do you want me to compare the speed of ConstMe vs Faster-Whisper on large just for benchmarking purposes ?

rsmith02ct commented 1 year ago

Please do compare them, that would be interesting.

I restarted SubtitleEdit and tried Const-me again with medium and this time it did work on my 4GB GPU. Last time it just "completed" but didn't produce a subtitle file. I haven't had time to troubleshoot and I mainly do transcriptions on my more powerful desktop machine anyway.

On Thu, Apr 13, 2023 at 12:11 AM ras0k @.***> wrote:

do you want me to compare the speed of ConstMe vs Faster-Whisper on large just for benchmarking purposes ?

— Reply to this email directly, view it on GitHub https://github.com/SubtitleEdit/subtitleedit/issues/6816#issuecomment-1505448184, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5EWOZMS77JIARX2FSW2TOTXA3AYVANCNFSM6AAAAAAW2Y4LVI . You are receiving this because you commented.Message ID: @.***>

Purfview commented 1 year ago

I don't see a need to test Faster Whisper unless it gets embedded with SubtitleEdit.

@rsmith02ct there is no problem with embedding it to SubtitleEdit, it's same as OpenAI you just need to download models manually, like described there: Faster-Whisper.

ras0k commented 1 year ago

I don't see a need to test Faster Whisper unless it gets embedded with SubtitleEdit.

@rsmith02ct there is no problem with embedding it to SubtitleEdit, it's same as OpenAI you just need to download models manually, like described there: Faster-Whisper.

did you actually try it ? from that method i gather we can use CPU but I'm not sure if it's using GPU

Purfview commented 1 year ago

did you actually try it ? from that method i gather we can use CPU but I'm not sure if it's using GPU

I can't test anything with GPU. Faster-Whisper should work same as OpenAI, if CUDA is detected then it should use GPU, if not then CPU. Check your GPU usage, and check if OpenAI runs on GPU.

niksedk commented 1 year ago

OK, I've added a test with Whisper CTranslate2: https://github.com/jordimas/whisper-ctranslate2

Included in latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.12/SubtitleEditBeta.zip

image

Note: Models will be downloaded the first time a model is used!

ras0k commented 1 year ago

OK, I've added a test with Whisper CTranslate2: https://github.com/jordimas/whisper-ctranslate2

Included in latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.12/SubtitleEditBeta.zip

image

Note: Models will be downloaded the first time a model is used!

🐐

ras0k commented 1 year ago

so I located the whisper-ctranslate2.exe thati had from my previous tests but I'm not sure it's taking the right model, i am looking for https://huggingface.co/guillaumekln/faster-whisper-large-v2 and all i see is large and it doesn't seem to be downloading anything

rsmith02ct commented 1 year ago

For a compiled version is Whisper-Faster-v2023.03.31-b77 what I should be using? [Edit- changed name of the exe to what SE is looking for but doesn't seem to work. I need a compiled binary from somewhere]

rsmith02ct commented 1 year ago

With the Base model and the same 2:35 file Const-me. 8 seconds CPP: 22 seconds FasterWhisper (CPU) :33 seconds [note this is an older build and not from https://github.com/jordimas/whisper-ctranslate2 as I can't compile it] OpenAI (Python) no GPU: 52 seconds.

rsmith02ct commented 1 year ago

OK, I've added a test with Whisper CTranslate2: https://github.com/jordimas/whisper-ctranslate2

Included in latest beta: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.6.12/SubtitleEditBeta.zip

image

Note: Models will be downloaded the first time a model is used!

Could the download and transcribe steps be separate? Right now it says "transcribing audio to text" but the reality is it's attempting to download a new model. 15 minutes later it's still downloading but says transcribing which is confusing.

Purfview commented 1 year ago

For a compiled version is Whisper-Faster-v2023.03.31-b77 what I should be using?

You can use it with both "OpenAI" or "CTranslate2" options. Just for "CTranslate2" rename whisper.exe to whisper-ctranslate2.exe. And for "CTranslate2" you don't need to have real or fake OpenAI's models.

Notes: We talk about standalone builds: whisper-standalone-win By default Whisper-Faster-v2023.03.31-b77 looks for models in the same folder, and they should be in folders named like this -> _faster-whisper_medium, it won't autodownload models, get them from https://huggingface.co/guillaumekln .

rsmith02ct commented 1 year ago

Thanks so much! This is the standalone build I downloaded: https://github.com/Purfview/whisper-standalone-win/releases/tag/v2023.03.31-b77-faster

SubtitleEdit thinks the models are here: C:\Users\Roger.cache\whisper but actually they should be where you say. in _faster-whisper_tiny.en for example and not named like the .py ones but just model.bin and the other files. Otherwise it gives an error.

and after all that...

2:35 file; base model Const-me. 8 seconds Whisper-faster: 16 seconds CPP: 22 seconds OpenAI (Python): 52 seconds.

Will test on my desktop where CUDA support should work unlike on this laptop.

Purfview commented 1 year ago

Const-me. 8 seconds Whisper-faster: 16 seconds

So on CPU it's even faster than Const-me, if you didn't subtracted ~12s startup delay.

rsmith02ct commented 1 year ago

I just had a chance to set it up on my desktop as well. It's an Intel i5-13600K with RTX 2080 Super GPU and CUDA works with SubtitleEdit.

Same file (2:35) with base models:

const me: 2.5s Ctranslate2: 4s CPP: 8s OpenAI: 11s

For this sample Ctranslate2 had higher quality output recognizing difficult proper names even with the base model.

I also tried the large model. Ctranslate2: 17s Const me: 19s CPP: 1m47s

CPP seems to use more cores than it used to- all 14 were in use at times.

On Thu, Apr 13, 2023 at 11:34 AM Purfview @.***> wrote:

Const-me. 8 seconds Whisper-faster: 16 seconds

So on CPU it's even faster than Const-me, if you didn't subtracted ~12s startup delay.

— Reply to this email directly, view it on GitHub https://github.com/SubtitleEdit/subtitleedit/issues/6816#issuecomment-1506244013, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5EWOZMX6RRO4DIOU5RDEMLXA5Q3HANCNFSM6AAAAAAW2Y4LVI . You are receiving this because you were mentioned.Message ID: @.***>

rsmith02ct commented 1 year ago

I tried it on a Japanese file but got an error with Faster Whisper.

From the log:

Date: 04/13/2023 12:56:27 SE: 3.6.12.60 - Microsoft Windows NT 10.0.22621.0 - 64-bit Message: Calling whisper (CTranslate2) with : C:\Users\rsmit\Dropbox\transfer settings\Whisper-Faster\Whisper-Faster\whisper-ctranslate2.exe --language ja --model "large" "D:\Temp\fe4f2032-7730-4ce3-95b2-e5e59434827b.wav" UnicodeEncodeError: 'charmap' codec can't encode characters in position 26-45: character maps to

File "encodings\cp1252.py", line 19, in encode

File "D:\whisper-fast__main__.py", line 399, in cli

File "D:\whisper-fast__main__.py", line 406, in

Traceback (most recent call last):

[21096] Failed to execute script 'main' due to unhandled exception!

Calling whisper CTranslate2 done in 00:00:09.0137460 Loading result from STDOUT

OpenAiWhisper also doen't work right with Japanese (Const-me and CPP work without error though Const-me's timings are quite messed up leaving CPP as the best option for Japanese.)

Date: 04/13/2023 13:01:15 SE: 3.6.12.60 - Microsoft Windows NT 10.0.22621.0 - 64-bit Message: Calling whisper (OpenAI) with : C:\Users\rsmit\Dropbox\transfer settings\Whisper-OpenAI\whisper.exe --language ja --model "medium" "D:\Temp\42463c7d-fbb9-415a-a1c7-f1f7b3aed181.wav" UnicodeEncodeError: 'charmap' codec can't encode characters in position 26-45: character maps to

File "encodings\cp1252.py", line 19, in encode

File "whisper\transcribe.py", line 170, in add_segment

File "whisper\transcribe.py", line 209, in transcribe

File "whisper\transcribe.py", line 314, in cli

File "D:\whisper__main__.py", line 4, in

Traceback (most recent call last):

[19592] Failed to execute script 'main' due to unhandled exception! Calling whisper OpenAI done in 00:00:09.0315877 Loading result from STDOUT

On Thu, Apr 13, 2023 at 12:42 PM Roger Smith @.***> wrote:

I just had a chance to set it up on my desktop as well. It's an Intel i5-13600K with RTX 2080 Super GPU and CUDA works with SubtitleEdit.

Same file (2:35) with base models:

const me: 2.5s Ctranslate2: 4s CPP: 8s OpenAI: 11s

For this sample Ctranslate2 had higher quality output recognizing difficult proper names even with the base model.

I also tried the large model. Ctranslate2: 17s Const me: 19s CPP: 1m47s

CPP seems to use more cores than it used to- all 14 were in use at times.

On Thu, Apr 13, 2023 at 11:34 AM Purfview @.***> wrote:

Const-me. 8 seconds Whisper-faster: 16 seconds

So on CPU it's even faster than Const-me, if you didn't subtracted ~12s startup delay.

— Reply to this email directly, view it on GitHub https://github.com/SubtitleEdit/subtitleedit/issues/6816#issuecomment-1506244013, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5EWOZMX6RRO4DIOU5RDEMLXA5Q3HANCNFSM6AAAAAAW2Y4LVI . You are receiving this because you were mentioned.Message ID: @.***>

Purfview commented 1 year ago

I tried it on a Japanese file but got an error with Faster Whisper. UnicodeEncodeError: 'charmap' codec can't encode characters in position

Thanks for report. I've used custom "quick-hack" command-line interface for it, as faster-whisper doesn't come with CLI. Maybe today I'll compile it with proper CLI. Could you share a short sample of that Japanese audio for tests?

guillaumekln commented 1 year ago

Hello,

For this sample Ctranslate2 had higher quality output recognizing difficult proper names even with the base model.

I suppose each implementation is running with its default parameters. Note that they have a different performance/quality trade-off by default. For example whisper-ctranslate2 (faster-whisper) is using a beam size of 5 by default (higher quality, slower) while whisper.cpp and Const-me/Whisper are using a beam size of 1 by default (lower quality, faster).

This should be considered when comparing the transcription time. Ideally they should all use the same parameters.

rsmith02ct commented 1 year ago

Hi Purfview, for Japanese I think you can just use any audio and just tell it that it's Japanese. If you want to use the file I have been testing on (to also assess quality) I can provide it as it's an old public video. link Some quality issues are non-existent prefecture names (the worst are nonsensical, second-worst is 三重県 which is not what the guy is saying and correct is 宮城県) kaki should be 牡蠣 oyster not the fruit 柿

Very interesting on performance/quality. Is that something SubtitleEdit could let us set with a slider?

ras0k commented 1 year ago

Hello,

For this sample Ctranslate2 had higher quality output recognizing difficult proper names even with the base model.

I suppose each implementation is running with its default parameters. Note that they have a different performance/quality trade-off by default. For example whisper-ctranslate2 (faster-whisper) is using a beam size of 5 by default (higher quality, slower) while whisper.cpp and Const-me/Whisper are using a beam size of 1 by default (lower quality, faster).

This should be considered when comparing the transcription time. Ideally they should all use the same parameters.

the man, the myth, the legend.

Purfview commented 1 year ago

@rsmith02ct @niksedk

Same error with OpenAI, it's SubtitleEdit's issue. I think, instead of "Loading result from STDOUT" it should load srt file, same as CPP.