Vocoder fine-tuning on synthetic data

t-dan commented 3 years ago

Hello, in default setting, the vocoders are trained on mel-spectra computed from the real speech signals. When they are fed by the Tacotron-generated spectra, the quality is a bit lower.

I would like to try to fine-tune (or train from scratch, it does not matter) a vocoder from the synthesized (i.e. the Tacotron-generated) mel-spectrogram. However, there is an issue in it - while the real spectrograms are aligned with the original speech (#frames*hop_size = #samples), it is naturally not true for the synthesized data.

Did someone tried to experiment with this?

Thank you, DT

OnceJune commented 3 years ago

enable gta in taco

t-dan commented 3 years ago

OK, thank you. But where to find it? I can't find such a switch in Tacotron2 code.

dathudeptrai commented 3 years ago

@t-dan the code from extract_duretion is all you need. https://github.com/TensorSpeech/TensorFlowTTS/blob/master/examples/tacotron2/extract_duration.py#L159-L166. Here we use teacher forcing.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

TensorSpeech / TensorFlowTTS

Vocoder fine-tuning on synthetic data #691