Reqeique / Dimits

MIT License
9 stars 2 forks source link

Local voices #3

Closed ThisModernDay closed 2 months ago

ThisModernDay commented 2 months ago

Is there a way to use local voices with this package? I have personally trained onnx files that I would prefer to use over pipers models.

Reqeique commented 2 months ago

Interesting! I have been thinking about such an implementation lately. Apparently, as you can observe, text-to-speech models, unlike most generative models, aren’t just tokenized text in, tokenized text out. Even in harmonized platforms such as Hugging Face, some of the most popular models, namely XTTS-v2 from Coqui and parler_tts_mini_v0.1 from Parler-TTS, differ in their architecture and lack a proper config.json. This is crucial for the pipeline (which is sort of a universal adaptor that takes an input and spits out a NumPy array of the audio) to work. In addition to that, Piper from Rhasspy uses an entirely different approach where the texts are phenomized before being fed into the corresponding process. Certainly, this diversity of approaches to process text-to-speech decreases the probability and makes it difficult to create a sort of implementation that goes about different models. Generally these are my assumption

I would like to know what you think and if there's work around for the problem.

ThisModernDay commented 2 months ago

So the voice files I have trained were trained and built using piper. they contain the correct json config files. as far as ones trained with other software I'm unsure.

Reqeique commented 2 months ago

In that case it should work. Great feature request! Consideration for inclusion of the fix in the upcoming release is underway. Should there be interest in self-implementation, contributions via PR are always welcomed!