Closed f1yn closed 2 years ago
I've been spending some time today looking into the source code for synthesizing speech, and it's become apparent that models are in-fact using tensors when computing model results, and then when CUDA can't be enabled - it falls back onto CPU rendering.
And while I haven't identified where it does any pre-compute for the CPU-rendering pathway (yet), based the numbers alone it seems that a single execution will scale out to all available logical processors when computing.
A workaround for my use case would probably be to use containers that have limited CPU resources allocated to them. If anyone has the time though to answer some or all of my questions that would be appreciated.
This is not a bug. I push it to the discussions.
Describe the bug
I wrote a small batch script that, will take a small paragraph of text, and parse through permutations of the different models to produce text, and it does this whenever more than one model is being executed (I've tried variations of 2, 3 4, and 5).
This isn't a typical use-case for sure, but I was under the impression after reading the docs that the compute-heavy part of this process was the creation of models, not the execution of existing ones. I eventually want to run TTS within a container on embedded hardware, but I am concerned about the overhead, as running a single paragraph through the parser is sending my computer and it's fans into the stratosphere. If it gets any hotter I might melt 🥵
I don't in full technically think this is a bug. I feel as through the software is technically doing what it's intending to do, but I do have some questions that will help me figure out how to tune the use cases and make embedding this easier.
Are certain models more computationally heavier than others? What makes them heavier? I don't need a highly technical answer to this, but if you could share a small blurb maybe I could locate tooling that could reduce these models to make them preferable for an embedded system.
When a paragraph is sent to the TTS as the text argument, I notice that it does some sort of delimiter splitting and based on what I'm seeing, trying to handle each of those sentences asynchronously. If this is true, can I tune this behavior, or use some configuration flag that will tweak models to execute sequentially instead?
Is this software thread-safe? I've experienced similar CPU flooring in the past, specifically with python-based software on Linux, where it's basically threads start in-fighting when more than one instance of a multi-threaded tech runs. I can't remember the exact place I read about this issue, but if this is a known limitation of generating text concurrently that would help me narrow it down.
I am using WSL 2 to run execute models. Is this a problem? Again, I was under the impression that computing the data into models was the heavy part, and that once the models are stable you just feed them inputs, which wouldn't require access to GPUs or AI acceleration HW. I could, and may be completely wrong about this_
Are there recommended ways to potentially execute these models using optimized platforms? I also am considering having multiple containers that will sequentially handle computing TTS, but this performance ceiling issue is making me a bit worried about that being possible.
To Reproduce
Using existing models, generate waveforms using
tts
but more than one at a timeExpected behavior
I expect my CPU not glow with the burn of a thousand suns, as I'm just executing existing models (and not creating new ones).
Logs
No response
Environment
Additional context
No response