Audios cannot be longer of 12 seconds

This is a general problem with the architecture of neural TTS. The length of the synthesized audio is determined at the training phase. And since the model is trained with 12 seconds segments, it can only synthesize 12 seconds. The reason is the memory restrictions mostly during the training, since everything is calculated in the memory; although the limit can be increased with training the models with higher memory GPUs, it will be marginal and will never reach audiobook lengths.

There are currently better alternatives for the architecture, which are able to synthesize longer text/audio with better performance

FastSpeech
FastSpeech2
Tacotron DDC (in mozilla/TTS among multiple architectures, see also)

Having said that, even these architectures do not solve the problem of very long synthesis and can reach up to minutes of audio length. For a more thorough discussion of how to handle this architectural variety and evolution, see the future of the repo issue.

But for now the solution would be to use a text parser and synthesize the audio sequentially in chunks, as it is done in with the mycroft catotron plugin. And in fact, one positive outcome of this would be the possibility of parallelization which would address the other problem of latency.

CollectivaT-dev / catotron-cpu

Audios cannot be longer of 12 seconds #4