Open Jcwscience opened 2 years ago
Hi @Jcwscience ,
preliminary experiments suggest that 30 minutes should be enough to fine-tune. That would certainly require some fiddling to get the parameters right, for instance, you might need to adjust pitch mean/std normalization constants.
Also, have a look at: https://github.com/NVIDIA/DeepLearningExamples/issues/1004 .
FastPitch or FastSpeech 2 should be similar in terms of speed and quality; at this point, it all comes down to implementation and training recipe details. For FastPitch, it seems like coarse pitch averaging is just easier to train. I wouldn't recommend FastSpeech 1, as it suffers from pitch mode collapse.
@alancucki Thanks for getting back to me so quickly! I have one more question, I’m a little confused on how I can use a pre-trained model to start with. All I see on fast pitch is starting from the base dataset, and I just don’t have that kind of power, what with the global gpu shortage
And one more thing, —I’m fairly new to this unfortunately— the data I used before for the tacotron2 model was essentially this
/wherever/wavs/sentence1.wav|This is a test! … etc.
Do I need to change anything for fast pitch?
Not so quick this time, sorry! :)
To fine-tune a pre-trained model, you could use the --checkpoint-path
flag to load model weights. Note that this will resume the optimizer with the old learning rate and statistics.
A better solution would be to add a couple of lines of code to train.py
and load this manually:
checkpoint = torch.load(filepath, map_location='cpu')
sd = {k.replace('module.', ''): v
for k, v in checkpoint['state_dict'].items()}
getattr(model, 'module', model).load_state_dict(sd)
As for filelists, the format is almost the same
wavs/LJ016-0288.wav|pitch/LJ016-0288.pt|"Müller, Müller, He's the man," till a diversion was created by the appearance of the gallows, which was received with continuous yells.
wavs/LJ028-0275.wav|pitch/LJ028-0275.pt|At last, in the twentieth month,
You're gonna need pitch matrices - there are extractable with prepare_dataset.py
. More details are in this README section .
@alancucki Ahhhh, I had no idea what to do with the generated pitches. The readme just says update the file list with the pitches. And in my head I couldn’t for the life of me figure out what that meant. Do I also need to add a speaker number to the end or is it optional since there is only one?
Thanks! That is valuable feedback, I need to make it clear in the README.
You'd need speaker IDs only if you're having more than one speaker, otherwise they're optional.
Ok so I have everything lined up in the file lists, and I modified the train.py script as per your suggestion, however I am now getting several errors in the model keys.
RuntimeError: Error(s) in loading state_dict for FastPitch:
Missing key(s) in state_dict: "attention.query_proj.0.conv.weight", "attention.query_proj.0.conv.bias", "attention.query_proj.2.conv.weight", "attention.query_proj.2.conv.bias", "attention.query_proj.4.conv.weight", "attention.query_proj.4.conv.bias", "attention.attn_proj.weight", "attention.attn_proj.bias", "attention.key_proj.0.conv.weight", "attention.key_proj.0.conv.bias", "attention.key_proj.2.conv.weight", "attention.key_proj.2.conv.bias".```
@alancucki Is there a specific checkpoint model I should be using? I have one from the Nvidia model repo I'm trying to use but I'm not sure if it is the correct one.
I've been trying to set up a speech model on an Xavier NX, and I've been able to get Tacotron2/Waveglow running, however the the size of the models uses quite a lot of memory. I've been looking to use an alternative, but I'm not sure which one.
The main thing is that I would like to train it on my voice. Right now I have maybe 30 minutes of audio labeled that I used to fine-tune the Tacotron2 model, and I intend to gather several hours in total, but even with this I need a model that can be warm started. Can FastPitch or FastSpeech be retrained or fine-tuned from an existing model, with the data I have recorded? and which would be lighter on the resources?
Thanks in advance!