SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.71k stars 975 forks source link

This is very cool, but push to even higher gpu usage? #66

Closed junchen6072 closed 1 year ago

junchen6072 commented 1 year ago

First, thank you for this awesome work and it indeed improves the transcribe time a lot! But I'm wondering if it's possible to push to even higher gpu usage so it can be even faster? From my testing to transcribe a few audios whose length is between 2mins to 15mins, gpu usage is jumping between 70-90%, and occasionally drop to quite low. Tried to instantiate WhisperModel with higher cpu_threads and num_workers, but it doesn't seem to help? I guess there're some non trivial blocking cpu computation so gpu is not fully utilized. Tried to use a thread pool in python to submit jobs for audios, it has a bit improvement, the peak gpu usage can go higher, but I think on average it didn't increase too much.

Any ideas? Thanks!

guillaumekln commented 1 year ago

Tried to use a thread pool in python to submit jobs for audios,

That's a good approach. Did you also increase num_workers when doing that? Normally this should overlap kernel executions on the GPU and increase the usage.

junchen6072 commented 1 year ago

Tried to use a thread pool in python to submit jobs for audios,

That's a good approach. Did you also increase num_workers when doing that? Normally this should overlap kernel executions on the GPU and increase the usage.

Yes I did. I think the bottleneck may be more in the python code, we're blocking wait on self.model.generate

junchen6072 commented 1 year ago

Another observation is, using 2 threads in the pool seems better than more

guillaumekln commented 1 year ago

Are you using word_timestamps=True?

junchen6072 commented 1 year ago

Yes, this is slow?

guillaumekln commented 1 year ago

Yes, it's slower than the default transcription mode (see #45).

And some operations are indeed running on the CPU in this mode which explains the lower GPU usage. There could be further improvements in the future.

junchen6072 commented 1 year ago

I see, thanks! Is the cpu part mostly on faster-whisper, or CTranslate2?

guillaumekln commented 1 year ago

It's probably a contribution of both, but I don't know exactly.

Taking the OpenAI implementation as a reference, the following lines are run on CPU in CTranslate2:

https://github.com/openai/whisper/blob/v20230314/whisper/timing.py#L208-L214

These steps could benefit from a GPU implementation but I would need some time to come up with an efficient implementation. My first attempt had worse performance than the CPU version!

guillaumekln commented 1 year ago

Higher GPU usage would probably come from some form of batch execution. This is discussed in #59.