collabora / WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper.
https://collabora.github.io/WhisperSpeech/
MIT License
3.54k stars 185 forks source link

Slow when using example #143

Closed sirEgghead closed 3 weeks ago

sirEgghead commented 3 weeks ago

I'm running examples/text_to_audio_playback.py unmodified and it is taking approximately 3 hours per sentence. I'm on a 1Gbps/1Gbps and a new i7 with 32GB RAM. Is something wrong with my setup of the repo?

|██-------------------------------------------------------------------------------| 2.94% [22/748 06:24<3:31:35]]]

This is an example of the progress bar between sentences.

sirEgghead commented 3 weeks ago

I would assume that it has something to do with downloading the model data, but I would have expected it to not have to re-cache the data per each run. How would I accomplish this? And furthermore, how would I accomplish this on something with a small data store and needing a real time response, such as a remote rPi? :)

sirEgghead commented 3 weeks ago

I'm not sure if I'm chasing the wrong rabbit here. (I'm new to huggingface etc, sorry) I downloaded a snapshot of whisperspeech/whisperspeech from huggingface. Example code still does the same thing. I changed the model_ref line from collabora to whisperspeech, since that was the repo that I actually cloned. No change.

If I can download a copy of the models and get the script to use it, I think I can apply it to my use-case without issue, as long as it will not need any further contact with the internet. I'd like to learn how to cache the converted datasets mentioned in the README as well. I do like the audio samples that are provided, and I like the way the sample script sounds. It's just over 3 hours late to the party with the folly of my ways.

sirEgghead commented 3 weeks ago

Ok, final update before I leave for the day. I hopped into a python shell to try everything line by line. I have whisperspeech/whisperspeech and collabora/whisperspeech cached.

>>> snapshot_download(repo_id="whisperspeech/whisperspeech")
Fetching 34 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:00<?, ?it/s]
'C:\\Users\\sirEgghead\\.cache\\huggingface\\hub\\models--whisperspeech--whisperspeech\\snapshots\\180eba1e29acd6b271f90bc956d19f99639afe6e'    
>>> snapshot_download(repo_id="collabora/whisperspeech")     
Fetching 34 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:00<?, ?it/s]
'C:\\Users\\sirEgghead\\.cache\\huggingface\\hub\\models--collabora--whisperspeech\\snapshots\\180eba1e29acd6b271f90bc956d19f99639afe6e'

When I type pipe.generate("hello"), it pauses for 90 seconds then hits the progress bar that I mentioned in my first comment.

 |---------------------------------------------------------------------------------| 0.13% [1/748 00:23<4:48:24]

I run linux at the house and have a bit more control outside of the corporate environment. I can give it a try there as well. I'd really like to find out what this is downloading so that I can hopefully make a cache of it.

zoq commented 3 weeks ago

What you see is not the download that is slow but the model inference. The only reasonable way to accelerate this significantly is to run this on the GPU. Do you have a GPU that you can use during inference?

sirEgghead commented 3 weeks ago

I do, just not on the machine that I was setting it up on. I'll have to move it to another machine and save audio files for the cases where this is possible. Perhaps I'll fill in the variable content with less realistic TTS. Thank you for your help!