collabora / WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper.
https://collabora.github.io/WhisperSpeech/
MIT License
4.01k stars 218 forks source link

Any way to make inference faster? #33

Closed fakerybakery closed 10 months ago

jpc commented 11 months ago

Hey, there are a few ways to speed up inference that we are looking into. Right now I have some smaller PyTorch optimizations that I will commit, we can also take advantage of the tricks from this article: https://pytorch.org/blog/accelerating-generative-ai-2/

There is also a possibility to use whisper/llama.cpp or CTranslate2 (FasterWhisper) implementations to get better performance on both CPUs and GPUs.

That said right now we are looking into increasing the quality and adding support for more languages so if someone wants to look into this then we'd love a pull request. :)

We can also offer commercial support contracts to help people optimize and integrate the model into their products.

jpc commented 10 months ago

I made quite a bit of progress on this. It's not commited yet because the code is still a bit hacky but with torch.compile I can resynthesize the voice 17x faster than real time for the tiny S2A model on a 4090. (will be a bit slower for end-to-end)

The HQ S2A (small) model is 3.4x faster than real-time.

fakerybakery commented 10 months ago

Nice, that's amazing! Thanks so much!

jpc commented 10 months ago

Ok, all the improvements were merged and the end-to-end generation is more than 12x faster than real-time on a consumer 4090.

torch.compile needs some warmup so the first inference is a lot slower (around 30s) but it should be fast afterwards.

fakerybakery commented 10 months ago

Thanks!

fakerybakery commented 10 months ago

Hi, I'm trying to run WhisperSpeech on a Colab instance - I've used the example Colab provided. However, unfortunately it still seems to take longer than realtime to generate. Is there anything I need to do to enable these improvements? Thanks!