Closed saibharani closed 2 years ago
The energy predictor is not very important, in fact Lightspeech (https://arxiv.org/abs/2102.04040) suggests removing it completely. More layers in the pitch predictor can lead to improvements, but I'd say the biggest impact is the size of the latent space (adim) which is 384 by default for FastSpeech. Increasing it to e.g. 512 as in Tacotron might improve the quality a bit, but will in turn of course be a bit slower and require more memory. The same goes to the number of attention heads (aheads)
Overall there are two major bottlenecks for the quality in this toolkit. They were chosen to have the best tradeoff between speed, required resources and quality, but if speed and resources are ignored, quality can be improved. Those two bottlenecks are the decoder and the Vocoder.
Thanks for the detailed explanation I want to train a model for the best quality possible and compare it with the current one to see the difference. So as you suggested I will increase adim to 512 and increase layers in Pitch and Duration. I have an Indian languages model trained on VITS (End to End model) TTS, I observed that for unknown speakers the style transfer quality is a bit better there and I think the main lacking is with Duration and Pitch prediction in FastSpeech2.
I wanted to see if we train a better model that could be compared to VITS. I will report my comparison here once the training and evaluation are completed.
I would also like to know your thoughts on End-End models like VITS because similar to IMS-Toucan the language adaption part is faster and requires fewer data in VITS
Yes I'm also not too happy with the style transfer in the current version, it's the main thing I'm currently working on, but it turned out to be quite hard. Possibly, the pitch predictor and duration predictor should be told about speaker embeddings more explicitly, otherwise the prosody doesn't change much, only the voice of the speaker.
Sounds good, looking forward to hearing about whether scaling up the parameters helps.
Generally I think end-to-end models aren't that well suited for language adaptation in cases where you have only very very few datapoints for a new language. The vocoding task can be solved completely independent of the language, so we can just have that part in a separate model to make the language adaptation task easier for the spectrogram generator. Every component that is language-independent should be treated in a separate model, so that the language-specific model can focus on what's most important. There is a tradeoff: e2e models are a bit simpler to manage and the variational autoencoder design of VITS is a clever solution to a couple of problems that old-schoolish TTS has, but overall I think for the low-resource TTS task it's better to split the processing steps. If I wanted to make the best possible TTS, I probably would use a different approach, but Toucan is designed for low-resource mostly.
Trained a model with adim=512, duration_predictor_layers=3, duration_predictor_chans=512, pitch_predictor_layers=7, pitch_predictor_chans=512, energy_predictor_layers=3, energy_predictor_chans=512 instead of the default parameters for 65k steps with batch size 32, the model quality sounds a bit better but the pronunciation of some of the words is wrong, maybe it needs to be trained for more steps, what do you think about training for more steps? what will be the minimum number of training steps to avoid such pronunciation errors? and do you suggest any parameter changes or did I overdo any of the parameters?
Note: My training data size for all languages combined is more than 100Hrs
VITS model training is still not completed but I am expecting better prosody transfer based on my old model, I will try to update about that too once it is finished and tested.
I completed training VITS model for 120K steps. The model quality is not as good as the FS2 model may be due to sample rate change from 16k to 48k in the VITS model I trained, there is a lot of stutter in the model. I don't think it can learn as fast as fastspeech2, It needs to be trained for more than 250K steps to get a reasonable quality which is significantly higher than the training steps required for FS2.
I didn't change aheads in the new FS2 model, do you think increasing aheads to maybe 6 or 8 will improve prosody prediction? and does fine-tuning the HiFi GAN vocoder on the training data help improve quality or the given model is enough?
You mentioned PortaSpeech and I think the same team developed DiffSinger which seems to perform well for the singing task, do you think borrowing a few layers from that model especially layers for f0 prediction and MIDI layers can improve Prosody in our model? If you believe it is possible can you give some pointers about what layers can help improve Prosody?
Thanks
what do you think about training for more steps? what will be the minimum number of training steps to avoid such pronunciation errors? and do you suggest any parameter changes or did I overdo any of the parameters?
The amount of steps is a difficult thing to estimate, because it depends on many factors and our objective function is not perfectly accurate, since we are dealing with a one-to-many task. That's why VAEs, such as in VITS are very well suited for the TTS task. Too many steps can cause the loss to still decrease while the actual quality of the produced speech is getting worse again. The best thing is to just produce some audios at different step-counts and decide based on your own intuition. I usually go for 100k steps for English audiobook data and 130k steps for German audiobook data. The more data you have, the more steps you can safely take. Your parameters look fine to me.
I didn't change aheads in the new FS2 model, do you think increasing aheads to maybe 6 or 8 will improve prosody prediction? and does fine-tuning the HiFi GAN vocoder on the training data help improve quality or the given model is enough?
The attention heads are actually pretty important for the prosody, but increasing their amount leads to diminishing returns. I haven't tried it with FastSpeech, but back when I was using Tacotron and TransformerTTS, the number of attention heads played a very significant role on the final prosody. They are just super expensive in terms of computation cost because of the quadratic scaling and us having thousands of frames in the decoder.
The vocoder is agnostic to speakers and languages, so finetuning it should not be necessary. I'm working on extending our HiFiGAN vocoder to Avocodo (https://arxiv.org/abs/2206.13404) right now, that should further increase the quality and according to the authors improve the quality for unseen speakers (although I don't really see why that would be the case)
You mentioned PortaSpeech and I think the same team developed DiffSinger which seems to perform well for the singing task, do you think borrowing a few layers from that model especially layers for f0 prediction and MIDI layers can improve Prosody in our model? If you believe it is possible can you give some pointers about what layers can help improve Prosody?
There are probably a few components in the architecture that could help a lot, but they are fairly difficult to integrate. The whole VAE followed by a flow in their decoder architecture is a great idea. The f0 prediction in FastSpeech is already pretty good I think, the problem is more that for standard data it finds a mostly flat average solution instead of one that favors more interesting prosody curves. That's something that a VAE would probably be pretty good for. Maybe there are some tricks that could be done with data augmentation, I think a paper on that was just presented at Interspeech 2022 two days ago.
I think in order to get the best prosody, we need to include semantic information of the text content into the TTS, like in https://www.isca-speech.org/archive/pdfs/interspeech_2019/hayashi19_interspeech.pdf The big problem with that is that I want the TTS models to be unconstrained in terms of which languages can be used, but multilinguality and word embeddings are difficult to get in the same boat.
The attention heads are actually pretty important for the prosody, but increasing their amount leads to diminishing returns. I haven't tried it with FastSpeech, but back when I was using Tacotron and TransformerTTS, the number of attention heads played a very significant role on the final prosody. They are just super expensive in terms of computation cost because of the quadratic scaling and us having thousands of frames in the decoder.
I am actually planning to train a new model with aheads=6 and revert energy parameters to defaults since I didn't see any signification improvement in energy prediction, Even if they are expensive I want to see if they can improve Prosody
The vocoder is agnostic to speakers and languages, so finetuning it should not be necessary. I'm working on extending our HiFiGAN vocoder to Avocodo (https://arxiv.org/abs/2206.13404) right now, that should further increase the quality and according to the authors improve the quality for unseen speakers (although I don't really see why that would be the case)
Yes I noticed that in the branch "EXPERIMENTAL_double_embedding_func" you are implementing Avocodo vocoder I am eagerly looking forward to see how it works in comparison to HiFiGAN, When do you think this vocoder will be ready for testing and will it be compatible with the existing FastSpeech2 model?
There are probably a few components in the architecture that could help a lot, but they are fairly difficult to integrate. The whole VAE followed by a flow in their decoder architecture is a great idea. The f0 prediction in FastSpeech is already pretty good I think, the problem is more that for standard data it finds a mostly flat average solution instead of one that favors more interesting prosody curves. That's something that a VAE would probably be pretty good for. Maybe there are some tricks that could be done with data augmentation, I think a paper on that was just presented at Interspeech 2022 two days ago.
Data Augmentation sounds like a good idea to improve f0 prediction especially for unseen speakers I will try to give it a try in my next training, Can you suggest any fast and easy way to implement Data Augmentation in the current preprocessing process? I am also planning to add Blizzard dataset to my current training data I am not sure if it helps in prosody but I want to try once.
I think in order to get the best prosody, we need to include semantic information of the text content into the TTS, like in https://www.isca-speech.org/archive/pdfs/interspeech_2019/hayashi19_interspeech.pdf The big problem with that is that I want the TTS models to be unconstrained in terms of which languages can be used, but multilinguality and word embeddings are difficult to get in the same boat.
I agree adding semantic information to text will limit the multi lingual part of the model but I think if we can integrate a VAE in the current model then I think there will be a significant improvement in the prosody and overall quality for unseen speakers but I am not good with PyTorch I used to use Tensorflow but since all famous TTS models are in PyTorch I moved to PyTorch and still learning, Do you think it is possible to integrate VAE in the current model?
Thanks
When do you think this vocoder will be ready for testing and will it be compatible with the existing FastSpeech2 model?
The vocoder training is going on for over a week now and the loss is still going down. It looks like this model needs a lot more steps than HiFiGAN on its own, which contradicts what others have found. With a better set of hyperparameters it would probably go much faster. For now I'll just keep the training run going. It will probably take at least one more week until it surpasses HiFiGAN, if that will even happen at all. The spectrogram input will stay the same, so it will be compatible with al previous models.
Can you suggest any fast and easy way to implement Data Augmentation in the current preprocessing process?
Unfortunately not, since it would probably be best to augment the wave and then extract the spectrogram from the augmented wave, but we only save the spectrogram and discard the wave for FastSpeech training. With the current setup it would probably be best to create new audio files and add them to the training set rather than modifying the audios on the fly, which would be preferable overall.
Do you think it is possible to integrate VAE in the current model?
That would be very nice and I also think that it is possible, especially since a lot of code can probably be taken from the PortaSpeech reference implementation (https://github.com/NATSpeech/NATSpeech). Unfortunately I don't have any time at the moment. The semester is about to start, so I have to spend a lot of time preparing for teaching. I hope that I can try the VAE/flow combination from PortaSpeech in the future. If you want to get some practice with PyTorch, feel free to try it yourself :)
Thanks for the input looking forward to seeing how new vocoder works.
I will try to combine VAE/flow from PortaSpeeech with FastSpeech2
Hi I am curious if we increase the number of layers for the duration, pitch, and energy using the 'duration_predictor_layers' parameter and some other parameters in the architecture, will it improve the duration and pitch predictions accuracy closer to the audio in the given embeddings sample?
If it does can you suggest some of the parameters that I can tweak if I want to train a bigger and better model.
Thanks