Open jordimas opened 3 years ago
This is a general problem with the architecture of neural TTS. The length of the synthesized audio is determined at the training phase. And since the model is trained with 12 seconds segments, it can only synthesize 12 seconds. The reason is the memory restrictions mostly during the training, since everything is calculated in the memory; although the limit can be increased with training the models with higher memory GPUs, it will be marginal and will never reach audiobook lengths.
There are currently better alternatives for the architecture, which are able to synthesize longer text/audio with better performance
Having said that, even these architectures do not solve the problem of very long synthesis and can reach up to minutes of audio length. For a more thorough discussion of how to handle this architectural variety and evolution, see the future of the repo issue.
But for now the solution would be to use a text parser and synthesize the audio sequentially in chunks, as it is done in with the mycroft catotron plugin. And in fact, one positive outcome of this would be the possibility of parallelization which would address the other problem of latency.
It seems that the generated audios cannot be longer of 12 seconds. You can try for example the text "VilaWeb fou el primer mitjà digital català en incorporar una plataforma de blogs personals fàcilment gestionable pels mateixos usuaris, el 2004 oferí als lectors i col·laboradors la possibilitat de crear els seus propis blogs, que aconseguiren cert protagonisme i activitat els anys següents."
I see a warning: "Warning! Reached max decoder steps" I do not know if this is related