152334H / tortoise-tts-fast

Fast TorToiSe inference (5x or your money back!)
GNU Affero General Public License v3.0
771 stars 179 forks source link

Docoupling voice model generation from text generation. #8

Open perkel666 opened 1 year ago

perkel666 commented 1 year ago

The issue.

If I understand it right tortoise does this:

Which means each time to produce one sentence it does each time finetuning.



The solution

hesz94 commented 1 year ago

That's precisely what "get_conditioning_latents" script does. The model isn't finetuned/trained on the voices, but rather extracts "latents" (think "voice description") as a pre-processing step, and then uses these latents "kind of" like a seed for the speech generation.

Sadly decoupling this process won't be a big time-saver, as in the current iteration generation of latents only takes a couple of seconds, however it's still a step forward.