When trying to synthesize my own text using the pretrained mellotron and waveglow models, I get poor audio quality (very croaky voice).
I use the inference method to not perform style transfer, however, I am also not sure what to pass in as input_style and f0s.
The following code is just to synthesize on speaker id 0 of the pretrained model. Is it normal the audio quality is relatively poor? My end goal is to finetune this model on a speech dataset in another language with 2 speakers.
text = "This is an example sentence."
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()
f0 = torch.zeros([1, 1, 32]).cuda()
speaker_id = torch.LongTensor([0]).cuda()
with torch.no_grad():
mel_outputs, mel_outputs_postnet, gate_outputs, alignments = mellotron.inference(
(text_encoded, 0, speaker_id, f0))
with torch.no_grad():
audio = denoiser(waveglow.infer(mel_outputs_postnet, sigma=0.7), 0.01)[:, 0]
ipd.Audio(audio[0].data.cpu().numpy(), rate=hparams.sampling_rate)
When trying to synthesize my own text using the pretrained mellotron and waveglow models, I get poor audio quality (very croaky voice). I use the inference method to not perform style transfer, however, I am also not sure what to pass in as input_style and f0s. The following code is just to synthesize on speaker id 0 of the pretrained model. Is it normal the audio quality is relatively poor? My end goal is to finetune this model on a speech dataset in another language with 2 speakers.