Question on Speech synthesis models

NVIDIA / DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.

12.93k stars 3.12k forks source link

Question on Speech synthesis models #1010

Open Jcwscience opened 2 years ago

Jcwscience commented 2 years ago

I've been trying to set up a speech model on an Xavier NX, and I've been able to get Tacotron2/Waveglow running, however the the size of the models uses quite a lot of memory. I've been looking to use an alternative, but I'm not sure which one.

The main thing is that I would like to train it on my voice. Right now I have maybe 30 minutes of audio labeled that I used to fine-tune the Tacotron2 model, and I intend to gather several hours in total, but even with this I need a model that can be warm started. Can FastPitch or FastSpeech be retrained or fine-tuned from an existing model, with the data I have recorded? and which would be lighter on the resources?

Thanks in advance!

alancucki commented 2 years ago

Hi @Jcwscience ,

preliminary experiments suggest that 30 minutes should be enough to fine-tune. That would certainly require some fiddling to get the parameters right, for instance, you might need to adjust pitch mean/std normalization constants.

Also, have a look at: https://github.com/NVIDIA/DeepLearningExamples/issues/1004 .

FastPitch or FastSpeech 2 should be similar in terms of speed and quality; at this point, it all comes down to implementation and training recipe details. For FastPitch, it seems like coarse pitch averaging is just easier to train. I wouldn't recommend FastSpeech 1, as it suffers from pitch mode collapse.

Jcwscience commented 2 years ago

@alancucki Thanks for getting back to me so quickly! I have one more question, I’m a little confused on how I can use a pre-trained model to start with. All I see on fast pitch is starting from the base dataset, and I just don’t have that kind of power, what with the global gpu shortage

Jcwscience commented 2 years ago

And one more thing, —I’m fairly new to this unfortunately— the data I used before for the tacotron2 model was essentially this

/wherever/wavs/sentence1.wav|This is a test! … etc.

Do I need to change anything for fast pitch?

alancucki commented 2 years ago

Not so quick this time, sorry! :)

To fine-tune a pre-trained model, you could use the --checkpoint-path flag to load model weights. Note that this will resume the optimizer with the old learning rate and statistics.

A better solution would be to add a couple of lines of code to train.py and load this manually:

checkpoint = torch.load(filepath, map_location='cpu')
sd = {k.replace('module.', ''): v
    for k, v in checkpoint['state_dict'].items()}
getattr(model, 'module', model).load_state_dict(sd)

As for filelists, the format is almost the same

wavs/LJ016-0288.wav|pitch/LJ016-0288.pt|"Müller, Müller, He's the man," till a diversion was created by the appearance of the gallows, which was received with continuous yells.
wavs/LJ028-0275.wav|pitch/LJ028-0275.pt|At last, in the twentieth month,

You're gonna need pitch matrices - there are extractable with prepare_dataset.py. More details are in this README section .

Jcwscience commented 2 years ago

@alancucki Ahhhh, I had no idea what to do with the generated pitches. The readme just says update the file list with the pitches. And in my head I couldn’t for the life of me figure out what that meant. Do I also need to add a speaker number to the end or is it optional since there is only one?

alancucki commented 2 years ago

Thanks! That is valuable feedback, I need to make it clear in the README.

You'd need speaker IDs only if you're having more than one speaker, otherwise they're optional.

Jcwscience commented 2 years ago

Ok so I have everything lined up in the file lists, and I modified the train.py script as per your suggestion, however I am now getting several errors in the model keys.


RuntimeError: Error(s) in loading state_dict for FastPitch:
    Missing key(s) in state_dict: "attention.query_proj.0.conv.weight", "attention.query_proj.0.conv.bias", "attention.query_proj.2.conv.weight", "attention.query_proj.2.conv.bias", "attention.query_proj.4.conv.weight", "attention.query_proj.4.conv.bias", "attention.attn_proj.weight", "attention.attn_proj.bias", "attention.key_proj.0.conv.weight", "attention.key_proj.0.conv.bias", "attention.key_proj.2.conv.weight", "attention.key_proj.2.conv.bias".```

Jcwscience commented 2 years ago

@alancucki Is there a specific checkpoint model I should be using? I have one from the Nvidia model repo I'm trying to use but I'm not sure if it is the correct one.