SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.23k stars 936 forks source link

major slowdown with batching commit - cpu only #917

Open ooobo opened 1 month ago

ooobo commented 1 month ago

Hey there, looking for advice on how to debug. Trialling the latest commit (#856) on my existing script, I can see a roughly 8x slowdown compared to the previous commit. Not using new BatchedPipeline, just original transcribegenerator, same parameters, same audio, same cpu (cpu only, no gpu). I load the model once, then run transcribe 30s chunks in a numpy ndarray, as audio comes from a livestream.

Any pointers on how to further pinpoint the slowdown to help debug?

ooobo commented 1 month ago

Cribbing jobus0's test script, I ran some tests.

repository compute_type clip length average elapsed time relative %
without #856 int8_float32 30s 2.5980s 100%
without #856 int8 30s 2.6320s 101%
with #856 int8_float32 30s 3.5081s 135%
with #856 int8 30s 3.5254s 136%

So not as great a slowdown as I first found, but still considerable. Only have int8 and float32 on this server but notice same sort of slowdown with other compute_types on a Macbook.

I wonder if this is CPU related? Can anyone else try using CPU only and let me know?

test script:


import faster_whisper
import time

model = faster_whisper.WhisperModel(
            "small",
            device="cpu",
            compute_type="int8_float32")

# warm up
segments, info = model.transcribe("audio.wav", beam_size=5)

total_start_time = time.time()

repeats = 10
for i in range(repeats):
    start_time = time.time()
    segments, info = model.transcribe("audio.wav", beam_size=5)
    print(f"Elapsed time: {time.time() - start_time:.4f}")

print()
print(f"Total elapsed time: {time.time() - total_start_time:.4f}")
print(f"Average elapsed time: {(time.time() - total_start_time)/repeats:.4f}")```
ooobo commented 1 month ago

Expanding the test to print segments from the segment generator too, the slowdown is much greater.

repository compute_type clip length average elapsed time relative %
without #856 int8_float32 30s 11.2693s 100%
with #856 int8_float32 30s 70.4650s 625%
test script: ```python import faster_whisper import time model = faster_whisper.WhisperModel( "small", device="cpu", compute_type="int8_float32") # warm up segments, info = model.transcribe("audio.wav", beam_size=5) total_start_time = time.time() repeats = 10 for i in range(repeats): start_time = time.time() segments, info = model.transcribe("audio.wav", beam_size=5) for segment in segments: print(segment.text) print(f"Elapsed time: {time.time() - start_time:.4f}") print() print(f"Total elapsed time: {time.time() - total_start_time:.4f}") print(f"Average elapsed time: {(time.time() - total_start_time)/repeats:.4f}")```
x86Gr commented 1 month ago

Have you tried with 10 minutes of audio and with the medium and large models?

benniekiss commented 1 month ago

The CPU slowdown may be caused by the switch to using torch instead of onnxruntime. I may be wrong in that assumption, but I know theres a related pyannote issue regarding slowdowns after switching to torch which might be worth looking into.

ooobo commented 1 month ago

The CPU slowdown may be caused by the switch to using torch instead of onnxruntime. I may be wrong in that assumption, but I know theres a related pyannote issue regarding slowdowns after switching to torch which might be worth looking into.

Mmm I did wonder that, torch seems to be the main change for the non-batched transcribe() function right? I remember having issues with pyannote and using torch tensor running slow on cpu, but can't remember if we got to a solution there.

Have you tried with 10 minutes of audio and with the medium and large models?

Good idea. I haven't run 10min versions of medium/large as the new commit version was looking likely to take hours. It's a pretty consistent 625+% slower to print segments.

repository compute_type model clip length average elapsed time relative %
without #856 int8_float32 small 10min 194.8097 100%
with #856 int8_float32 small 10min 1218.948s 626%
without #856 int8_float32 medium 30s 29.0115s 100%
with #856 int8_float32 medium 30s 183.4392s 632%
without #856 int8_float32 large 30s 47.8324s 100%
with #856 int8_float32 large 30s 329.3989s 688%
Jiltseb commented 1 month ago

I see that the problem occurs if you only have a very few CPU cores (0-3). If you have many more (16 or more), then it is indeed faster with the new commit. I tested with the file tests/data/jfk.flac from the directory. My suggestion would be to roll back to a previous release FTB.

One possible direction would be to modify setup.py with extras for the batched version.

x86Gr commented 1 month ago

I see that the problem occurs if you only have a very few CPU cores (0-8). If you have more (16 or more), then it is indeed faster with the new commit.

Cores or threads?

ooobo commented 1 month ago

I see that the problem occurs if you only have a very few CPU cores (0-3). If you have many more (16 or more), then it is indeed faster with the new commit. I tested with the file tests/data/jfk.flac from the directory. My suggestion would be to roll back to a previous release FTB.

One possible direction would be to modify setup.py with extras for the batched version.

Thanks for taking a look - this would make sense. Tested on Macbook with 8 cores, ARM server with 4 cores, Intel server with 4 cores, all slower with new commit. I'm targeting CPU-only hardware with low resources so maybe less common.

Do you think it's torch replacing numpy for the feature extraction and audio processing? While a bit clunky, I note transformers whisper has torch optional, and uses numpy if it's missing in its feature extractor - could a similar thing work here?

aligokalppeker commented 1 month ago

I see that the problem occurs if you only have a very few CPU cores (0-3). If you have many more (16 or more), then it is indeed faster with the new commit. I tested with the file tests/data/jfk.flac from the directory. My suggestion would be to roll back to a previous release FTB.

One possible direction would be to modify setup.py with extras for the batched version.

This is really a shitty answer and your PR really messed up the whisper. It is really valid scenario to use whisper model instances on 3-4 cores, your bullshit PR make it slow down instead of make it faster.

Jiltseb commented 1 month ago

I see that the problem occurs if you only have a very few CPU cores (0-3). If you have many more (16 or more), then it is indeed faster with the new commit. I tested with the file tests/data/jfk.flac from the directory. My suggestion would be to roll back to a previous release FTB. One possible direction would be to modify setup.py with extras for the batched version.

This is really a shitty answer and your PR really messed up the whisper. It is really valid scenario to use whisper model instances on 3-4 cores, your bullshit PR make it slow down instead of make it faster.

I never claimed it was invalid to have 3-4 cores. The Faster-whisper PR conducted several evaluations that confirmed a significant speed-up in general. Batching was a highly requested feature for the Faster-whisper project. You are encouraged to open a PR with alternative solutions. Targeting specific contributors is disrespectful and against our community guidelines.

aligokalppeker commented 1 month ago

I see that the problem occurs if you only have a very few CPU cores (0-3). If you have many more (16 or more), then it is indeed faster with the new commit. I tested with the file tests/data/jfk.flac from the directory. My suggestion would be to roll back to a previous release FTB. One possible direction would be to modify setup.py with extras for the batched version.

This is really a shitty answer and your PR really messed up the whisper. It is really valid scenario to use whisper model instances on 3-4 cores, your bullshit PR make it slow down instead of make it faster.

I never claimed it was invalid to have 3-4 cores. The Faster-whisper PR conducted several evaluations that confirmed a significant speed-up in general. Batching was a highly requested feature for the Faster-whisper project. You are encouraged to open a PR with alternative solutions. Targeting specific contributors is disrespectful and against our community guidelines.

How you can claim that is the requested feature for faster-whisper? And is this viable to solve the batching problem like this? Currently batching can be handled by alternative solutions.

What is the general scenarios you mention that speeds up the faster whisper? are you sure you are mapping to the all used scenarios ? how can you speak in your limited context?

You need to revert the commit to make any alternative solution to be done on top of it. Regarding PR and the things done are bullshit.

Getting faster-whisper's flexibility and agility and making it something bloated, is not what the faster-whisper community should demand.

Let's revert the code, and make all efforts to make this PR separate and agile. Please note that making an effort does not make you the owner of the repo.

x86Gr commented 1 month ago

This toxic behaviour is not getting you anything good. If you have few cores and that PR slowed you down, discuss it with the developers involved. Calm down and realize you're on this planet with other people which are not offending you full throttle. You're free to switch to adult behaviour and discuss politely, or to open a fork.

aligokalppeker commented 1 month ago

This toxic behaviour is not getting you anything good. If you have few cores and that PR slowed you down, discuss it with the developers involved. Calm down and realize you're on this planet with other people which are not offending you full throttle. You're free to switch to adult behaviour and discuss politely, or to open a fork.

No this is not toxic behaviour, this is a strong response to avoid the destruction of the project as the issue is not only the performance of faster-whisper on low-core systems. Faster Whisper is not just a batch-oriented and offline project and this PR and these developers transform the project to it without knowing/respecting project origins and community usage.

x86Gr commented 1 month ago

This toxic behaviour is not getting you anything good. If you have few cores and that PR slowed you down, discuss it with the developers involved. Calm down and realize you're on this planet with other people which are not offending you full throttle. You're free to switch to adult behaviour and discuss politely, or to open a fork.

No this is not toxic behaviour, this is a strong response to avoid the destruction of the project as the issue is not only the performance of faster-whisper on low-core systems. Faster Whisper is not just a batch-oriented and offline project and this PR and these developers transform the project to it without knowing/respecting project origins and community usage.

Any PR as well as any release can have any number of bugs. The authors of the PR are looking into it, meanwhile you can use 1.0.2 or whatever version works well for your system. Bugs and regressions happen everyday, everywhere on github and the wise response is to be helpful to authors, not to offend them. You want faster-whisper to works well on 1-4 cores? Provide feedback, test stuff, be helpful.

aligokalppeker commented 1 month ago

This toxic behaviour is not getting you anything good. If you have few cores and that PR slowed you down, discuss it with the developers involved. Calm down and realize you're on this planet with other people which are not offending you full throttle. You're free to switch to adult behaviour and discuss politely, or to open a fork.

No this is not toxic behaviour, this is a strong response to avoid the destruction of the project as the issue is not only the performance of faster-whisper on low-core systems. Faster Whisper is not just a batch-oriented and offline project and this PR and these developers transform the project to it without knowing/respecting project origins and community usage.

Any PR as well as any release can have any number of bugs. The authors of the PR are looking into it, meanwhile you can use 1.0.2 or whatever version works well for your system. Bugs and regressions happen everyday, everywhere on github and the wise response is to be helpful to authors, not to offend them. You want faster-whisper to works well on 1-4 cores? Provide feedback, test stuff, be helpful.

You still do not get it and you think too simple to relate that to testing and bug fixing. This is normal as you do not see my overall analysis of the PR.

https://github.com/SYSTRAN/faster-whisper/issues/937

Based on the design and implementation nature of the PR, any bug fix will not correct the issues.

ozancaglayan commented 1 month ago

The slowness may be caused by setting the Whisper CPU threads to 16 instead of the default 0 which would use only the available number of CPU cores.

ozancaglayan commented 1 month ago

Best thing to do is to revert this change of setting cpu_threads to 16 as there shouldn't be such assumptions in a codebase used by a large user base

https://github.com/SYSTRAN/faster-whisper/commit/eb8390233c160a8232abf88f9b949eb5cbc48df8#r144691402

x86Gr commented 1 month ago

Using Zen4 processors I didn't notice a major speed difference between the number of cores and threads

ozancaglayan commented 1 month ago

Okay, I see that CTranslate2 has an unorthodox way of setting the threads. Looking at their codebase, they seem to be setting a default value of 4 (or less looking at the actual number of CPU threads) if OMP_NUM_THREADS environment value is not set and the cpu_threads is set to 0. Other tools like torch and numpy would always try to use the number of CPU threads available in the machine unless you set MKL_NUM_THREADS = OMP_NUM_THREADS = N in your environment.

But its unclear, whether they limit the number of threads set, if it's manually passed as above, e.g. 16.

Doing a test now on machine with 4 physical cores and 8 threads, on an audio file of 30 seconds, VAD is disabled, verifying core usage using htop:

model = WhisperModel('small', 'cpu', cpu_threads=0)
list(model.transcribe(x)[0]) # warmup
%timeit list(model.transcribe(x)[0], vad_filter=False)
Results: CPU_THREADS Time
0 (effectively set to 4 by CT2) 4.4 s
4 4.4 s
1 14.4 s
8 (virtual threads) 6.5s
16 8.15s
multiprocessing.cpu_count() // 2 (effectively 4 on this machine) 4.4s

Conclusion:

So yes, this is substantially slowing down the transcription on machines that does not have 16 threads. The usual rule of thumb here is to let the toolkit decide the INTER_THREAD parallelism but unfortunately CTranslate2 does not seem to do it but instead rely on a sane default of 4. I then recommend using the multiprocessing.cpu_count() // 2 logic as the default in faster-whisper which is guaranteed to give the best speed for everybody out there. ML pipelines usually never benefits from virtual threads (aka hyper-threading), in this case its actually quite slower than the number of physical cores.

If you all agree, I can do a PR to set the default this way.

ooobo commented 1 month ago

Thanks @ozancaglayan , I think you've nailed this particular issue, setting cpu_threads or making that change sees the post-batch version pretty much on par in my tests. Haven't had time to test other changes but that's a good start.

Also I don't cosign the unhinged, rude comments above, very strange.

ozancaglayan commented 2 weeks ago

I created a PR for this now