Future of the repo - Roadmap

As I mentioned in the audio length issue, in the middle term catotron has to change its TTS architecture. There are multiple reasons for this, but before going into the details, the main problem is that the field is rapidly evolving and the open source projects are not designed as production systems but to showcase the new innovative architectural set-ups.

Having said that the specific issues with catotron (and catotron-cpu) are:

1) Catotron is based on Nvidia/tacotron2 and for approximately last 2 years Nvidia has been in maintenance mode. As far as I understand the development that is going on is not relevant to our needs. And in fact they don't have a repository structure/mentality setup for support of multiple languages. That's why catotron existed as a fork with the Catalan front-end (the text processing).

2) catotron-cpu is based on a copy (let's say a broken fork) of Nvidia/tacotron2 configured for the CPU. The correct way of setting up catotron-cpu would have been, directly forking NVIDIA/tacotron2 with the necessary configuration changes, and maintain the code with the upstream. Until now there has not been major changes with the NVIDIA/tacotron2 and the models are still consistent (in fact the training is done in the catotron repo which in principle should be following the updates of the original).

3) Due to the rapidly evolving nature of the TTS architectures, the most attractive choices now (as mentioned in xx)

They solve some problems of TTS necessities in Catalan, but not all. One recent advance has been the use of "streaming" architecture which work for arbitrarily long sentences. The spectral generation and the vocoder work in almost parallel fashion, and the audio is accessible in streaming without waiting for the end of the whole process, basically minimizing the latency. This was proposed in last interspeech. Unfortunately there is not an open repository that implements this architecture, yet.

My proposal: I have some ideas, but I am looking forward to hearing from the community, based on everybody's necessities. My basic idea is to overhaul the repository making it modular, making it possible to use and move to whichever architecture, keeping the API endpoints, docker commands, and other community/Catalan specific functionalities common if we can. Especially the text parsers and parallelization.

The part of the inspiration is the relation between mozilla/TTS and synesthesiam/docker-mozillatts: In which mozilla/TTS repo is a complete library for training, testing, experimentation and which includes multiple architectures. Whereas the synesthesiam/docker-mozillatts is aimed for immediate use and taking advantage of the most convenient single architecture/model at a given time for production. In fact the synesthesiam/docker-mozillatts repo is recognized and referenced by the mozilla/TTS readme.

Why not just contribute to mozilla/TTS? I think it is a good idea to implement Catalan to mozilla/TTS (although I haven't figured out how yet), train the models and make them available through that repo. But as in the example of synesthesiam/docker-mozillatts it is a good idea to use catotron-cpu for deployment. And in case there will be a more attractive architecture, catotron-cpu can make use of it, porting its own high level functionalities.

If we are going to change architecture does it make sense to invest time on NVIDIA/tacotron2 based catotron My answer is yes and the reason is twofold:

The most important functionalities such as "synthesizing audiobooks" needs high level development such as a parser plus a parallelized inference, and it applies to all the currently available architectures.
The other problems that the model has (pronunciation errors, lack of prosody for questions, failing at inference even if it is short) are related to the data. I.e. increasing the data quality (more segments, carefully construct the training set for questions prosody etc) would solve these problems and effectively can be used with the new architectures when needed.

Hence my priorities would be:

[X] Exploit the Festcat data set fully and create a higher quality data set.
[X] Retrain the tacotron2 models.
[ ] Modularize the catotron-cpu repo with links to multiple architectures
[X] Train a more modern architecture
[ ] Train a new vocoder
[ ] Retrain with phonemes
[ ] Memory and size optimize models

Let me know what you think.

CollectivaT-dev / catotron-cpu

Future of the repo - Roadmap #7