From reading other papers that use similar techniques, some of these have trained models on huge datasets such as LibriLight (VALL-E, SpeechX). Have you tested with bigger datasets than LibriTTS, or do you think this would make any significant impact on model zero-shot for editing and continuation compared to just training on LibriTTS?
Perhaps by grouping a new unseen speaker's samples together with a small number of seen speakers to create a dataset, fine tuning a pretrained model could be possible in a shorter time? Have you tried this?
In the paper you trained text2vec to 50 epochs on LibriTTS, did you see any improvements after 50 epochs still or were the changes negligible?
Of course scaling up the dataset (as well as the model) will produce better results. As this is a university-launched project, we haven't tried training it on LibriLight as that would cost much larger computational resources. But we believe it is worth a try. Notice that decoder-only language models like VALL-E has a much larger minimum requirement for the data scale to work well. If someone train a VALL-E only a small academic-scale data, the performance can hardly be as good as that described in the demo. But non-autoregressive models like UniCATS adapt to small and large corpus both.
If you have a good amount of data from the target speaker, instead of only a sentence, you can surely try to finetune a pre-trained model. But note that in this circumstance, the problem will become few-shot TTS instead of zero-shot TTS. UniCATS is aimed at the zero-shot scenarios where we don't plan to finetune the model on the target speaker.
We assume that 50 epochs are enough for the model to converge, and another 50 epochs will only yield nuance improvements. Due to the training time, we chose to stop training at that number.
Congratulations on acceptance in AAAI.
I have some questions about your training method.
From reading other papers that use similar techniques, some of these have trained models on huge datasets such as LibriLight (VALL-E, SpeechX). Have you tested with bigger datasets than LibriTTS, or do you think this would make any significant impact on model zero-shot for editing and continuation compared to just training on LibriTTS?
Perhaps by grouping a new unseen speaker's samples together with a small number of seen speakers to create a dataset, fine tuning a pretrained model could be possible in a shorter time? Have you tried this?
In the paper you trained text2vec to 50 epochs on LibriTTS, did you see any improvements after 50 epochs still or were the changes negligible?
Thanks again for this wonderful project.