NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 187 forks source link

The effect of passing in the original MEL seems minimal #45

Closed pneumoman closed 3 years ago

pneumoman commented 4 years ago

Hi,

In studying the way this model works, using the pre-trained model, I found that it seemed like the MEL was having little effect. In fact in my very small blind poll (my co-worker and our 'sound guy') that the effect was detrimental. Curious about your thoughts.

# from code snippet in box 14 of inference.ipynb

with torch.no_grad():
    mel_outputs, mel_outputs_postnet, gate_outputs, _ = mellotron.inference_noattention(
        (text_encoded, mel, speaker_id, pitch_contour, rhythm))
moonnee commented 3 years ago

Hi, I encountered the same question. When I synthesize from Hallelujah musicxml with the provided checkpoint, the synthesized samples seem to have little difference if changing the mel input. Here are three examples, all synthesized with sid=40, pan=-42, and the mel inputs are zeros, mels extracted from example1.wav and example2.wav respectively. https://drive.google.com/file/d/10fQdc25FJMHb7Twpwhie35hf2C8I69Dv/view?usp=sharing Curious about what the effect of global style token here. Actually, if we synthesize from musicxml, we do not have a mel for input. It is strange to input a mel when synthesize from musicxml. Pls advice. Thanks.

rafaelvalle commented 3 years ago

We recently fixed a bug in our code that scaled the audio inputs to the wrong range and made the Mels unefective. Please pull from master, use a Mel sample that has different characteristics from your training set (screaming or whispering), try again and let us know.

pneumoman commented 3 years ago

seems better