Open DatGuy1 opened 4 years ago
I run
python3 generate_mels.py
https://github.com/CookiePPP/codedump/blob/master/tacotron2-PPP-1.3.0/generate_mels.py#L85
It runs through the train and validation filelists and outputs .npy
files next to every .wav
in the source dataset.
Unless I misunderstood you, I'm more referring to https://github.com/CookiePPP/codedump/blob/master/TacotronPAG/data_utils.py#L260. Or are you saying that the alignments are the same as the mel spectrograms?
Ah sorry,
those are made by another teacher model. Mellotron / Tacotron2.
https://colab.research.google.com/drive/1jdHhXP38xk1IcfCsl3PvZjsx390-pLOk?usp=sharing
The generate_alignments(
function was used.
I haven't yet added a way of doing this outside of the notebook - and Mellotron support was kinda put on hold while I'm distracted. (see experimental branch for my focus at any time)
Thanks, got that to work. Just wondering, what would you say are the biggest differences between your versions and nvidia's? Something like less training data needed, faster convergence, more similar to ground truth?
@DatGuy1 Anything in specific? I do this for fun and experimentation, and I've tested out a decent few bits.
I've made models with a little above 1000 pairs that have worked well. However, I have 600 lines from a video game character. I'm hoping that if the PAG attention paper was right, I could synthesize intelligible speech. Have you created any models that sound decent enough with ~500 lines as training data?
https://408b917a210f.ngrok.io/ I can make it sound good with as little as 50 seconds of data (see Derpy or 50% of the other voices).
The WaveGlow still needs work but that's for later
Multispeaker architectures are much higher quality and easier to create than single speaker with little data.
Damn, that's pretty good.
Multispeaker architectures are much higher quality and easier to create than single speaker with little data.
You mean with mellotron instead of Tacotron? How did you train the voices in that website? Your TacotronPPP code?
Aye. It's very messy code but it's performing better than PAG on each speaker.
Do you perhaps have a IPython notebook for PPP like the PAG one above?
For training? The datasets are local so I haven't moved anything else online yet. I can do it if you need, though it'll take a bit.
Yeah, or maybe just a simple tutorial on what to do, e.g. generate alignments and mels for audio -> First run script x -> then run script y -> then start train.py, etc.
I'll try and set that up in a couple of days
edit 08/05/20: I'm having lots of issues with gradient overflows on my version of Dynamic Convolution Attention so I might not start on the guide for a bit.
Thanks. If you want, it doesn't have to be a ipython notebook, just a list of basic steps.
@DatGuy1 Do you have a copy of your dataset or an example of the structure?
This notebook; https://colab.research.google.com/drive/1IsOD3AOrZJyQmtdNaef9y3eNLhk8yf3F?usp=sharing Will be filled out once I can figure out the best way to process datasets from Google Drive. (The preprocessing code is all on my machine already, most of the work is done. I'm just not sure how to handle unique datasets).
Seems great. Two questions: 1. When training, what should I use for group name, forcing warm start, and rank if anything? 2. How do I generate speech? Do I initialise T2S and then call infer()?
@DatGuy1 I'd set;
group_name
to "group_name"
warm_start_force
to True
(it just ignores the optimizer and any layers that have changed)rank
to 0
(if you only use a single GPU)2.
I'll see about providing something else, but for now you could use the app.py
assuming you update
"speaker_ids_file": "H:/ClipperDatasetV2/filelists/speaker_ids.txt",
to where you need.
and the modelpath
s too.
The notebook will have default hparams added at some point. This is meant to go in /mlp/ so this notebook is somewhat targeted towards a no-modifications required set up, letting the non-coders do some experimentation. (we'll see how that pans out)
Do you have a copy of your dataset or an example of the structure?
It's 800 lines of Stephen Merchant as Wheatley in Portal 2. Useful since subtitles already exist as transcripts.
@DatGuy1 I'm more referring to where the transcripts and audio files are located.
Not sure I understand. It's like in tacotron2, with a filelist that points to the audio files.
@DatGuy1 Alright, that should be fine. I'm familiar with more annoying formats for data :smile:
I've gotten app.py to work, but I wonder how multispeaker mode works?
Also, are there any big differences between each WaveGlow and Torchmoji models?
@DatGuy1 WaveGlow converts spectrograms to sound. TorchMoji predicts emotion from text (which seems to improve Tacotron2 performance with the My Little Pony dataset).
"Multispeaker mode" on the webpage is just how the speakers are selected when inputting large segments of text. It doesn't do much right now.
I'll work on the training notebook again another day. I'm done with typing for today~
I understood what they do, but I don't understand the effects of each specific one, e.g. Postnet/Prenet, the differing number of steps, etc.
Take your time! There's no rush
I've been trying to train a little without TorchMoji, but it seems to use a very large amount of memory. With fp16, a batch size of 1, pregenerated mel spectrograms, and 22050Hz sampling rate, it's still using 14gb of memory.
Scratch that. I pregenerated the mels but wasn't actually using them. I just have a few more questions:
Also, I'm not sure if the filelists should be in ARPABET or not.
How do I implement TorchMoji like you did?
Save the torchMoji hidden state to .npy
file. Save with same name as audio file with "_" added before file ext and save to same dir.
When I warm start, do I use warm_start_force or just warm start? If I want to do something like add a voice to the model.
warm_start_force
is just an automatic warm_start. I'll reset any layers that don't match between checkpoint and current model.
So, yes I would use warm_start_force
when changing the maximum number of speakers and just let it reset the layer that needs to be reset.
Which pretrained model do I use for the warm start?
I don't provide any models (or have any fully trained). I intend to move over to something similar to Flow-TTS as soon as I can figure out the code for it.
How many steps should I train a speaker?
Till val_loss
and validation.average_max_attention_weight
on Tensorboard stop decreasing and increasing respectively.
Decrease learning rate once both stop improving and continue till bored.
Do I need to train WaveGlow as well?
I can add support for Nvidia's pretrained 22Khz models quite easily (though make sure that Tacotron2 params match the WaveGlow of course). Otherwise, you may prefer to train your own.
I should also look into conversion between PaddlePaddle and Pytorch weights. My WaveFlow code follows the same style as the PaddlePaddle one so their pretrained weights should be compatible with this ones.
I'm not sure if the filelists should be in ARPABET or not
I use both at the same time. :man_shrugging:
I'm making a mess of my dataset processing at the moment so sorry if training can't be replicated.
Save the torchMoji hidden state to .npy file. Save as same name with "_" added before file ext and save to same dir.
You mean run this through the file list?
I don't provide any models (or have any fully trained). I intend to move over to something similar to Flow-TTS as soon as I can figure out the code for it.
When I warm_start_force, I warm start off something right? Do I warm start it off the one trained to 188k steps?
You mean run this through the file list?
This is the code that was used initially. Drop it somewhere in the torchMoji package.
https://gist.github.com/CookiePPP/29aa720e78e7f8038ee0153027926238
Update the lines below with your filelists and it might work.
INPUT_PATHS = [
'/media/cookie/Samsung 860 QVO/ClipperDatasetV2/filelists/train_taca2.txt',
'/media/cookie/Samsung 860 QVO/ClipperDatasetV2/filelists/validation_taca2.txt',
]
Do I warm start it off the one trained to 188k steps
Sure, that'd be fine.
I should also look into conversion between PaddlePaddle and Pytorch weights. My WaveFlow code follows the same style as the PaddlePaddle one so their pretrained weights should be compatible with this ones.
I tried downloading their pretrained 128 channel WaveFlow model, but it's in their own .pdparams format and I'm not sure how to convert it to your weights. Also, if I'd like to train my own model, I'm guessing I should use your waveglow_latest directory?
warm_start_force is just an automatic warm_start. I'll reset any layers that don't match between checkpoint and current model. So, yes I would use warm_start_force when changing the maximum number of speakers and just let it reset the layer that needs to be reset.
Hmm, doesn't that reset all the older speakers as well?
I trained a speaker for 2.5k steps with a batch size of 26, and so far it's low pitched and unintelligible. I'm not sure if I messed something up and something clearer should play or if I should train it more. Thoughts?
@DatGuy1
Hmm, doesn't that reset all the older speakers as well?
https://github.com/CookiePPP/codedump/blob/master/tacotron2-PPP-1.3.0/hparams.py#L102
n_speakers=512,
This hparam decides how large the emedding layer is, and that in turn decides how many speakers the model can use at a time.
If you only have 512 or less speakers then yes, you don't need to change anything and the weights will not change.
Hmm, doesn't that reset all the older speakers as well?
Yes, even without resetting the layers/changing the weights. The original code to map from external speaker ids to the internal 0 -> 511 indexes of the embedding layer is crap.
def create_speaker_lookup_table(self, audiopaths_and_text):
speaker_ids = np.sort(np.unique([x[2] for x in audiopaths_and_text]))
d = {int(speaker_ids[i]): i for i in range(len(speaker_ids))}
return d
The id's are sorted alphabetically before being assigned. e.g:
speaker ids of
0,1,2,3,4,5,6,7,8,9,10
are sorted to this
0,10,1,2,3,4,5,6,7,8,9
and then the ids go up from left to right. Leftmost gets the first slice of the embedding, and so on.
So if you added another speaker as ID 11, all the ID's after 0,10
are misaligned. (as shown below)
0,1,2,3,4,5,6,7,8,9,10,11
to
0,10,11,1,2,3,4,5,6,7,8,9
This means adding another ID will (likely) require retraining most of the speaker embedding layer anyway.
(I'd like to remove this and just have the speaker_ids on the filelists directly translate to internal, but now I've got old models getting in the way) (I hope to fix this when switching over to Flow-TTS, we'll see what happens then).
I trained a speaker for 2.5k steps with a batch size of 26, and so far it's low pitched and unintelligible. I'm not sure if I messed something up and something clearer should play or if I should train it more. Thoughts?
Not sure.
I tried downloading their pretrained 128 channel WaveFlow model, but it's in their own .pdparams format and I'm not sure how to convert it to your weights. Also, if I'd like to train my own model, I'm guessing I should use your waveglow_latest directory?
That's the one. I don't think I've got any up-to-date configs uploaded yet so nag me if you get onto that. Also, WaveFlow seems to run much slower than the claims in the paper. I'm not sure why so I don't really recommend using it over other solutions right now. I'd like to try running WaveFlow inference with TorchScript to get the compiler optimizations and see how performance changes, but that's got to happen later when I'm not focused on other bits. (or to be specific, at this exact moment I'm waiting on datasets to download :watch: )
Edit:
If you wanted to add another speaker without shifting the speaker ids, you could try
999
and just conform to the alphabetical nature of the current system. I wouldn't call it a solution but if you really need to keep the speaker ids aligned.
I can't test the PaddlePaddle model for mel-to-wave since running torch.load on the .pdparams file fails. I'm guessing I need to convert it?
@DatGuy1 Not sure. Never tried it. :man_shrugging:
Hmm, how large of an impact do you think using a random speaker from one of your WaveGlow models have? Not sure if the issue with my speaker is due to a bad wave on mel or mel to wave model.
Hmm, how large of an impact do you think using a random speaker from one of your WaveGlow models have?
I think it'd be a small difference. Adding the speaker ids made very little difference (at least within the first day~ish of training).
In the datasets for your speakers with a small amount of data, how was your filelist formatted? Was it just file|text|id? Did the text have any start/EOS tokens?
file|text|id
no start/end tokens.
And do you recall what your average max attention weight and loss for validation was?
average max attention weight
0.72~
val_loss
somewhere 0.25 ish. Really depends on how much drop_frame_rate
you use (higher will increase loss/make the spectrogram more blurred, but increases stability up to a point).
https://github.com/CookiePPP/codedump/blob/master/tacotron2-PPP-1.3.0/run_every_epoch.py#L8
Two things:
I'm very noobish with the learning rate. My average max attention weight went from 0.58 to 0.6 in 5k steps, and validation loss has actually gone up from 2 to 2.6. I'm assuming those numbers are wrong. My starting learning_rate is 0.1e-5 as is default which I'm not sure why I didn't change. Weight decay and run_every_epoch.py are also default. What should I change the starting learning rate and params in run_every_epoch to?
When I open the inferred audio in an audio editing program and manually change the pitch + ~100%, then it sounds like it could be decent but I can't tell due to the distortion. Only thing I could think of that could impact it is in the my dataset the audio files are 22050Hz while yours are 44100Hz, but I changed that in the hparams. Anything I'm missing?
@DatGuy1
What should I change the starting learning rate and params in run_every_epoch to?
run_every_epoch
will override anything in hparams. It runs 'live' and I manually adjust the learning rate using it.
https://www.desmos.com/calculator/x6fkjjnhut
decay_start
, A_
, B_
, C_
decide the learning rate and can be messed with using the link above, where x = iteration, y = learning_rate.
Learning rate should start around 1e-3
, and decrease to 1e-5
from 100,000 iterations to 300,00 iterations.
for example, a run_every_epoch
like below would work fine;
current_iteration = iteration
decay_start = 300000
if current_iteration < 100000:
A_ = 100e-5
elif current_iteration < 150000:
A_ = 50e-5
elif current_iteration < 200000:
A_ = 20e-5
elif current_iteration < 250000:
A_ = 10e-5
elif current_iteration < 300000:
A_ = 5e-5
else:
A_ = 5e-5
B_ = 30000
C_ = 0e-5
min_learning_rate = 1e-6
epochs_between_updates = 1
drop_frame_rate = min(0.000010 * max(current_iteration-5000,0), 0.2) # linearly increase DFR from 0.0 to 0.2 from iteration 5000 to 45000.
p_teacher_forcing = 0.95
teacher_force_till = 0
val_p_teacher_forcing=0.80
val_teacher_force_till=30
grad_clip_thresh = 1.5
Anything I'm missing? https://github.com/NVIDIA/tacotron2/blob/master/hparams.py#L36-L42
sampling_rate=22050, filter_length=1024, hop_length=256, win_length=1024, n_mel_channels=80, mel_fmin=0.0, mel_fmax=8000.0,
Everything other than
n_mel_channels
will need to be updated for a different sampling rate. I wouldn't recommend learning default params from this repo, this is called codedump because it's where I dump all of my code. :smile:
Thanks! You're using numbers in the hundred thousands, but since I'm warm starting it's starting at 0 with ~6/s per iteration. Do I reduce your numbers to something more along the lines of divided by 100?
@DatGuy1 If you're warm starting from one of the already existing models then you'll need to find your own ideal learning rates. The schedule I showed would be used if training from scratch.
Say you're adding a new speaker with a small dataset. You set the training and validation filelists, generate the mels and torchmoji emotions. Then, do you start train with warm start? Or normal? And what would your learning rate be?
warm start would be faster, and learning rate would be 100e-5
to start with.
I trained a model to a mere 2k steps with these hparams:
sampling_rate=22050, filter_length=1024, hop_length=256, win_length=1024, n_mel_channels=80, mel_fmin=0.0, mel_fmax=8000.0
And tried to generate with the "LargeWaveGlow V3.5" model. As expected, it failed due to my model having 176 channels (?) instead of the expected 256. So I implemented using NVIDIA's pretrained WaveGlow model, and despite still being unintelligible the pitch sounds right. However, the alignments I see in tensorboard are pretty much nothing. Before I changed those settings the alignments looked good and were very linear. Basically I'm wondering if what I changed broke the whole process?
I'm thinking I'll try to remake the dataset with 44100Hz audio and be a guinea pig for your cookietts repo.
@CookiePPP when I try to add a new speaker it overwrites the previously trained speaker. I think it's something to do with the weird ordering of the speaker lookup table. I started my first voice with --warm_start_force and speaker ID 297, since it was the next free number. That worked well enough. Then I trained another voice with ID 298, but whenever I tried to use the 297 voice then the 297 voice returned to a voice that sounds like it was in the original model. What do I do?
@DatGuy1 I don't fully understand, but if you want to train multiple new voices you should train them together.
Huh. So you mean take my filelists and merge them together? I thought I would be able to add speakers as time goes on.
How do you generate the .npy alignments from the audio files?