CollectivaT-dev / catotron-cpu

Various architectures of TTS models for Catalan, made easy to run.
BSD 3-Clause "New" or "Revised" License
3 stars 3 forks source link

Future of the repo - Roadmap #7

Open gullabi opened 3 years ago

gullabi commented 3 years ago

As I mentioned in the audio length issue, in the middle term catotron has to change its TTS architecture. There are multiple reasons for this, but before going into the details, the main problem is that the field is rapidly evolving and the open source projects are not designed as production systems but to showcase the new innovative architectural set-ups.

Having said that the specific issues with catotron (and catotron-cpu) are:

1) Catotron is based on Nvidia/tacotron2 and for approximately last 2 years Nvidia has been in maintenance mode. As far as I understand the development that is going on is not relevant to our needs. And in fact they don't have a repository structure/mentality setup for support of multiple languages. That's why catotron existed as a fork with the Catalan front-end (the text processing).

2) catotron-cpu is based on a copy (let's say a broken fork) of Nvidia/tacotron2 configured for the CPU. The correct way of setting up catotron-cpu would have been, directly forking NVIDIA/tacotron2 with the necessary configuration changes, and maintain the code with the upstream. Until now there has not been major changes with the NVIDIA/tacotron2 and the models are still consistent (in fact the training is done in the catotron repo which in principle should be following the updates of the original).

3) Due to the rapidly evolving nature of the TTS architectures, the most attractive choices now (as mentioned in xx)

They solve some problems of TTS necessities in Catalan, but not all. One recent advance has been the use of "streaming" architecture which work for arbitrarily long sentences. The spectral generation and the vocoder work in almost parallel fashion, and the audio is accessible in streaming without waiting for the end of the whole process, basically minimizing the latency. This was proposed in last interspeech. Unfortunately there is not an open repository that implements this architecture, yet.

My proposal: I have some ideas, but I am looking forward to hearing from the community, based on everybody's necessities. My basic idea is to overhaul the repository making it modular, making it possible to use and move to whichever architecture, keeping the API endpoints, docker commands, and other community/Catalan specific functionalities common if we can. Especially the text parsers and parallelization.

The part of the inspiration is the relation between mozilla/TTS and synesthesiam/docker-mozillatts: In which mozilla/TTS repo is a complete library for training, testing, experimentation and which includes multiple architectures. Whereas the synesthesiam/docker-mozillatts is aimed for immediate use and taking advantage of the most convenient single architecture/model at a given time for production. In fact the synesthesiam/docker-mozillatts repo is recognized and referenced by the mozilla/TTS readme.

Why not just contribute to mozilla/TTS? I think it is a good idea to implement Catalan to mozilla/TTS (although I haven't figured out how yet), train the models and make them available through that repo. But as in the example of synesthesiam/docker-mozillatts it is a good idea to use catotron-cpu for deployment. And in case there will be a more attractive architecture, catotron-cpu can make use of it, porting its own high level functionalities.

If we are going to change architecture does it make sense to invest time on NVIDIA/tacotron2 based catotron My answer is yes and the reason is twofold:

Hence my priorities would be:

Let me know what you think.

gullabi commented 2 years ago

The data processing of Festcat is finished. The total amount of speech is increased considerably.

Ona: 4 hours to 6 hours 12 minutes Pau: 4 hours 15 minutes to 6 hours 54 minutes

The training ready datasets can be downloaded from here (Ona) and here (Pau).