NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
853 stars 184 forks source link

The most optimal audio settings for training dataset preparation? #14

Open AndroYD84 opened 4 years ago

AndroYD84 commented 4 years ago

Previously I trained on a small dataset where the speaker was recorded in a single shot (so the volume level and quality never changed for the entire duration of it), the resulting model sounded promising (it was just for testing) so I decided to train on a larger dataset afterwards where the speaker was recorded in mutiple shots on different places and equipment (so the volume levels and quality varied), the resulting model was a disaster, had an insanely high volume, so much that it could ruin your eardrums if you kept your earphones on, clipping and screeching almost all the time. However, none of the training dataset sounded too high or weird at all, the sound levels were pretty much normal to the ear and never hit the red bar, so I guess all these audio files need to be normalized/processed first such as having all the same optimal dB volume level or things can go horribly wrong, what parameters do you suggest using for the audio data to get the best of it? Of course I converted them all at 22050 Hz mono.

rafaelvalle commented 4 years ago

Share an image of the training and validation loss, the rhythm (alignment map) and the pitch contour (f0) used during inference such that we can investigate better.

AndroYD84 commented 4 years ago

Here's the image of the training and validation loss and the rhythm (alignment map) and the pitch contour (f0) used during inference. The style transfer results are likely going to get better considering how early the train is (checkpoint at 35500 iterations, I also have it at 36000 but it couldn't generate any sound for some reason), however, the singing voice results are insanely loud and clipping most of the time, which is weird because I don't remember it to be like this with the previous dataset I tested. I have shared the entire dataset here if you would like to check, thanks! P.S.: Now that I think about it, I'm using only upper-case letters for the transcription, could this affect the results compared to having lower-case letters?

rafaelvalle commented 4 years ago

Set the smoothing factor of your validation loss to 0 and zoom in such that we can take a look at the curve. My suspicion is that your not picking the model with the lowest validation loss.

Also, would love to hear singing samples from donald trump if you have them :-)

AndroYD84 commented 4 years ago

I went for a GIF, hope it's clearer now: No zoom: https://i.imgur.com/NsMzzTO.gif With zoom: https://i.imgur.com/HmkV5X7.gif

rafaelvalle commented 4 years ago

Just zoom in on the long line after the big drop such that we can see the variation on that line.

AndroYD84 commented 4 years ago

Alright, I hope it's better now: https://i.imgur.com/VRHpvd9.png Oh and I'll definitely share some Trump samples as soon as I can get decent results, the ones I made weren't worth keeping, in the past I synthesized some songs using Trump voice but I used a completely different method than the one adopted here, so I'm not sure if this is the place to share them as they're unrelated to this repo, perhaps your email?

rafaelvalle commented 4 years ago

The training and validation curves, alignment and other things look fine. Your data seems to have audio files from multiple speaker, although you're training it as a single speaker. This might be the source of the problem. You can also train further and check again later.

daxiangpanda commented 4 years ago

Can you share your batch_size and how many steps when you get this good result? @AndroYD84

AndroYD84 commented 4 years ago

@daxiangpanda I haven't changed any parameters other than pointing my own dataset and training with the LibriTTS pretrained model provided. However this time I'm using a new dataset (4750 files, total duration 2h 55m) as the old one had a few transcription errors, I manually checked that the audio transcriptions were perfect this time. The super loud screech I experienced before was still present at the beginning, but disappeared the more I kept training, I'm just beginning to get listenable results at 235000 iters.

rafaelvalle commented 4 years ago

Closing due to inactivity.