Provides training, inference and voice conversion recipes for RADTTS and RADTTS++: Flow-based TTS models with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.
We are trying to train a singing model. We are satisfied with the timbre of the sound being produced through the decoder - it sounds like singing, at least using ground truth features from the training data. However, the lyrics are typically not recognizable, at least with the amount of training that typically generates recognizable speech from text. We know that the phoneme encodings are reasonable since we can train text to speech models, and have tried warmstarting from a text to speech model. Have you trained a singing model, and what sort of data / training curriculum did you use? Thanks!
We are trying to train a singing model. We are satisfied with the timbre of the sound being produced through the decoder - it sounds like singing, at least using ground truth features from the training data. However, the lyrics are typically not recognizable, at least with the amount of training that typically generates recognizable speech from text. We know that the phoneme encodings are reasonable since we can train text to speech models, and have tried warmstarting from a text to speech model. Have you trained a singing model, and what sort of data / training curriculum did you use? Thanks!