Closed ThisModernDay closed 7 months ago
Interesting! I have been thinking about such an implementation lately. Apparently, as you can observe, text-to-speech models, unlike most generative models, aren’t just tokenized text in, tokenized text out. Even in harmonized platforms such as Hugging Face, some of the most popular models, namely XTTS-v2 from Coqui and parler_tts_mini_v0.1 from Parler-TTS, differ in their architecture and lack a proper config.json. This is crucial for the pipeline (which is sort of a universal adaptor that takes an input and spits out a NumPy array of the audio) to work. In addition to that, Piper from Rhasspy uses an entirely different approach where the texts are phenomized
before being fed into the corresponding process. Certainly, this diversity of approaches to process text-to-speech decreases the probability and makes it difficult to create a sort of implementation that goes about different models.
Generally these are my assumption
I would like to know what you think and if there's work around for the problem.
So the voice files I have trained were trained and built using piper. they contain the correct json config files. as far as ones trained with other software I'm unsure.
In that case it should work. Great feature request! Consideration for inclusion of the fix in the upcoming release is underway. Should there be interest in self-implementation, contributions via PR are always welcomed!
Is there a way to use local voices with this package? I have personally trained onnx files that I would prefer to use over pipers models.