I have already trained the TTS using the Fast Pitch implementation with my custom data. Now I want to fine tune the vocoder (Hifigan), which was already trained on the same speaker data. Do I need to perform fine tunning? Would that increase the quality as I believe vocoder will learn to generate audio from not perfect mels (generated from fast pitch generated mels, rather from the ground truth itself).
If yes, can I generate Mels from the Fast Pitch for fine tuning the downstream vocoder? If yes, how to generate the same length and aligned Mels? Need some instructions. Thanks you!