Feat text2wave performance

This PR depends on the recently merged https://github.com/festvox/speech_tools/pull/24

Thanks to these two commits (see the messages for details), text2wave scales much better with long texts both in memory and disk i/o usage.

The way text2wave worked before this PR is as follows: Given a long text, it is split in utterances which are saved in temporal files. At the end, all temporal files are loaded into memory and appended in memory one by one, and finally dumped to the output file. This approach requires double the disk space (for both the temporal files and the final file) and requires to have the whole synthesized utterance loaded in memory.

With the new implementation, each text paragraph is synthesized and written directly to the final audio file, avoiding storing the whole utterance in memory and extra i/o in disk.

Patch from Debian: https://salsa.debian.org/tts-team/festival/-/blob/9ae3e3ce9d50171cdec62e609309d589c289a534/debian/patches/05-performance-combine-waves.diff

festvox / festival

Feat text2wave performance #36