NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
855 stars 183 forks source link

Synthesize own text without style transfer gives poor audio results #120

Open ocesp98 opened 1 year ago

ocesp98 commented 1 year ago

When trying to synthesize my own text using the pretrained mellotron and waveglow models, I get poor audio quality (very croaky voice). I use the inference method to not perform style transfer, however, I am also not sure what to pass in as input_style and f0s. The following code is just to synthesize on speaker id 0 of the pretrained model. Is it normal the audio quality is relatively poor? My end goal is to finetune this model on a speech dataset in another language with 2 speakers.

text = "This is an example sentence."
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()

f0 = torch.zeros([1, 1, 32]).cuda()
speaker_id = torch.LongTensor([0]).cuda()

with torch.no_grad():
    mel_outputs, mel_outputs_postnet, gate_outputs, alignments = mellotron.inference(
        (text_encoded, 0, speaker_id, f0))

with torch.no_grad():
    audio = denoiser(waveglow.infer(mel_outputs_postnet, sigma=0.7), 0.01)[:, 0]
ipd.Audio(audio[0].data.cpu().numpy(), rate=hparams.sampling_rate)
mepc36 commented 1 year ago

Just upvoting to say I had same problem, so that's +1 for the "this might be normal" vote.