Training for singing models

We are trying to train a singing model. We are satisfied with the timbre of the sound being produced through the decoder - it sounds like singing, at least using ground truth features from the training data. However, the lyrics are typically not recognizable, at least with the amount of training that typically generates recognizable speech from text. We know that the phoneme encodings are reasonable since we can train text to speech models, and have tried warmstarting from a text to speech model. Have you trained a singing model, and what sort of data / training curriculum did you use? Thanks!

NVIDIA / radtts

Training for singing models #29