CookiePPP / codedump

Somewhere to dump code
5 stars 4 forks source link

Question about loading alignments #3

Open DatGuy1 opened 4 years ago

DatGuy1 commented 4 years ago

How do you generate the .npy alignments from the audio files?

CookiePPP commented 4 years ago

I run python3 generate_mels.py https://github.com/CookiePPP/codedump/blob/master/tacotron2-PPP-1.3.0/generate_mels.py#L85 It runs through the train and validation filelists and outputs .npy files next to every .wav in the source dataset.

DatGuy1 commented 4 years ago

Unless I misunderstood you, I'm more referring to https://github.com/CookiePPP/codedump/blob/master/TacotronPAG/data_utils.py#L260. Or are you saying that the alignments are the same as the mel spectrograms?

CookiePPP commented 4 years ago

Ah sorry, those are made by another teacher model. Mellotron / Tacotron2. https://colab.research.google.com/drive/1jdHhXP38xk1IcfCsl3PvZjsx390-pLOk?usp=sharing The generate_alignments( function was used.

I haven't yet added a way of doing this outside of the notebook - and Mellotron support was kinda put on hold while I'm distracted. (see experimental branch for my focus at any time)

DatGuy1 commented 4 years ago

Thanks, got that to work. Just wondering, what would you say are the biggest differences between your versions and nvidia's? Something like less training data needed, faster convergence, more similar to ground truth?

CookiePPP commented 4 years ago

@DatGuy1 Anything in specific? I do this for fun and experimentation, and I've tested out a decent few bits.

DatGuy1 commented 4 years ago

I've made models with a little above 1000 pairs that have worked well. However, I have 600 lines from a video game character. I'm hoping that if the PAG attention paper was right, I could synthesize intelligible speech. Have you created any models that sound decent enough with ~500 lines as training data?

CookiePPP commented 4 years ago

https://408b917a210f.ngrok.io/ I can make it sound good with as little as 50 seconds of data (see Derpy or 50% of the other voices).

The WaveGlow still needs work but that's for later


Multispeaker architectures are much higher quality and easier to create than single speaker with little data.

DatGuy1 commented 4 years ago

Damn, that's pretty good.

Multispeaker architectures are much higher quality and easier to create than single speaker with little data.

You mean with mellotron instead of Tacotron? How did you train the voices in that website? Your TacotronPPP code?

CookiePPP commented 4 years ago

Aye. It's very messy code but it's performing better than PAG on each speaker.

DatGuy1 commented 4 years ago

Do you perhaps have a IPython notebook for PPP like the PAG one above?

CookiePPP commented 4 years ago

For training? The datasets are local so I haven't moved anything else online yet. I can do it if you need, though it'll take a bit.

DatGuy1 commented 4 years ago

Yeah, or maybe just a simple tutorial on what to do, e.g. generate alignments and mels for audio -> First run script x -> then run script y -> then start train.py, etc.

CookiePPP commented 4 years ago

I'll try and set that up in a couple of days

edit 08/05/20: I'm having lots of issues with gradient overflows on my version of Dynamic Convolution Attention so I might not start on the guide for a bit.

DatGuy1 commented 4 years ago

Thanks. If you want, it doesn't have to be a ipython notebook, just a list of basic steps.

CookiePPP commented 4 years ago

@DatGuy1 Do you have a copy of your dataset or an example of the structure?

This notebook; https://colab.research.google.com/drive/1IsOD3AOrZJyQmtdNaef9y3eNLhk8yf3F?usp=sharing Will be filled out once I can figure out the best way to process datasets from Google Drive. (The preprocessing code is all on my machine already, most of the work is done. I'm just not sure how to handle unique datasets).

DatGuy1 commented 4 years ago

Seems great. Two questions: 1. When training, what should I use for group name, forcing warm start, and rank if anything? 2. How do I generate speech? Do I initialise T2S and then call infer()?

CookiePPP commented 4 years ago

@DatGuy1 I'd set;

2. I'll see about providing something else, but for now you could use the app.py assuming you update "speaker_ids_file": "H:/ClipperDatasetV2/filelists/speaker_ids.txt", to where you need. and the modelpaths too.


The notebook will have default hparams added at some point. This is meant to go in /mlp/ so this notebook is somewhat targeted towards a no-modifications required set up, letting the non-coders do some experimentation. (we'll see how that pans out)

DatGuy1 commented 4 years ago

Do you have a copy of your dataset or an example of the structure?

It's 800 lines of Stephen Merchant as Wheatley in Portal 2. Useful since subtitles already exist as transcripts.

CookiePPP commented 4 years ago

@DatGuy1 I'm more referring to where the transcripts and audio files are located.

DatGuy1 commented 4 years ago

Not sure I understand. It's like in tacotron2, with a filelist that points to the audio files.

CookiePPP commented 4 years ago

@DatGuy1 Alright, that should be fine. I'm familiar with more annoying formats for data :smile:

DatGuy1 commented 4 years ago

I've gotten app.py to work, but I wonder how multispeaker mode works?

Also, are there any big differences between each WaveGlow and Torchmoji models?

CookiePPP commented 4 years ago

@DatGuy1 WaveGlow converts spectrograms to sound. TorchMoji predicts emotion from text (which seems to improve Tacotron2 performance with the My Little Pony dataset).


"Multispeaker mode" on the webpage is just how the speakers are selected when inputting large segments of text. It doesn't do much right now.


I'll work on the training notebook again another day. I'm done with typing for today~

DatGuy1 commented 4 years ago

I understood what they do, but I don't understand the effects of each specific one, e.g. Postnet/Prenet, the differing number of steps, etc.

Take your time! There's no rush

DatGuy1 commented 4 years ago

I've been trying to train a little without TorchMoji, but it seems to use a very large amount of memory. With fp16, a batch size of 1, pregenerated mel spectrograms, and 22050Hz sampling rate, it's still using 14gb of memory.

Scratch that. I pregenerated the mels but wasn't actually using them. I just have a few more questions:

  1. How do I implement TorchMoji like you did?
  2. When I warm start, do I use warm_start_force or just warm start? If I want to do something like add a voice to the model.
  3. Which pretrained model do I use for the warm start?
  4. How many steps should I train a speaker?
  5. Do I need to train WaveGlow as well?
DatGuy1 commented 4 years ago

Also, I'm not sure if the filelists should be in ARPABET or not.

CookiePPP commented 4 years ago

How do I implement TorchMoji like you did?

Save the torchMoji hidden state to .npy file. Save with same name as audio file with "_" added before file ext and save to same dir.


When I warm start, do I use warm_start_force or just warm start? If I want to do something like add a voice to the model.

warm_start_force is just an automatic warm_start. I'll reset any layers that don't match between checkpoint and current model. So, yes I would use warm_start_force when changing the maximum number of speakers and just let it reset the layer that needs to be reset.


Which pretrained model do I use for the warm start?

I don't provide any models (or have any fully trained). I intend to move over to something similar to Flow-TTS as soon as I can figure out the code for it.


How many steps should I train a speaker?

Till val_loss and validation.average_max_attention_weight on Tensorboard stop decreasing and increasing respectively. Decrease learning rate once both stop improving and continue till bored.


Do I need to train WaveGlow as well?

I can add support for Nvidia's pretrained 22Khz models quite easily (though make sure that Tacotron2 params match the WaveGlow of course). Otherwise, you may prefer to train your own.

I should also look into conversion between PaddlePaddle and Pytorch weights. My WaveFlow code follows the same style as the PaddlePaddle one so their pretrained weights should be compatible with this ones.


I'm not sure if the filelists should be in ARPABET or not

I use both at the same time. :man_shrugging:


I'm making a mess of my dataset processing at the moment so sorry if training can't be replicated.

DatGuy1 commented 4 years ago

Save the torchMoji hidden state to .npy file. Save as same name with "_" added before file ext and save to same dir.

You mean run this through the file list?

I don't provide any models (or have any fully trained). I intend to move over to something similar to Flow-TTS as soon as I can figure out the code for it.

When I warm_start_force, I warm start off something right? Do I warm start it off the one trained to 188k steps?

CookiePPP commented 4 years ago

You mean run this through the file list?

This is the code that was used initially. Drop it somewhere in the torchMoji package.

https://gist.github.com/CookiePPP/29aa720e78e7f8038ee0153027926238

Update the lines below with your filelists and it might work.

INPUT_PATHS = [
    '/media/cookie/Samsung 860 QVO/ClipperDatasetV2/filelists/train_taca2.txt',
    '/media/cookie/Samsung 860 QVO/ClipperDatasetV2/filelists/validation_taca2.txt',
    ]

Do I warm start it off the one trained to 188k steps

Sure, that'd be fine.

DatGuy1 commented 4 years ago

I should also look into conversion between PaddlePaddle and Pytorch weights. My WaveFlow code follows the same style as the PaddlePaddle one so their pretrained weights should be compatible with this ones.

I tried downloading their pretrained 128 channel WaveFlow model, but it's in their own .pdparams format and I'm not sure how to convert it to your weights. Also, if I'd like to train my own model, I'm guessing I should use your waveglow_latest directory?

warm_start_force is just an automatic warm_start. I'll reset any layers that don't match between checkpoint and current model. So, yes I would use warm_start_force when changing the maximum number of speakers and just let it reset the layer that needs to be reset.

Hmm, doesn't that reset all the older speakers as well?


I trained a speaker for 2.5k steps with a batch size of 26, and so far it's low pitched and unintelligible. I'm not sure if I messed something up and something clearer should play or if I should train it more. Thoughts?

CookiePPP commented 4 years ago

@DatGuy1

Hmm, doesn't that reset all the older speakers as well?

https://github.com/CookiePPP/codedump/blob/master/tacotron2-PPP-1.3.0/hparams.py#L102

        n_speakers=512,

This hparam decides how large the emedding layer is, and that in turn decides how many speakers the model can use at a time.

If you only have 512 or less speakers then yes, you don't need to change anything and the weights will not change.


Hmm, doesn't that reset all the older speakers as well?

Yes, even without resetting the layers/changing the weights. The original code to map from external speaker ids to the internal 0 -> 511 indexes of the embedding layer is crap.

https://github.com/CookiePPP/codedump/blob/7c1e3533c2ad2be7b7a1d781207f6f13d2f636aa/TacotronPAG/data_utils.py#L198-L201

    def create_speaker_lookup_table(self, audiopaths_and_text):
        speaker_ids = np.sort(np.unique([x[2] for x in audiopaths_and_text]))
        d = {int(speaker_ids[i]): i for i in range(len(speaker_ids))}
        return d

The id's are sorted alphabetically before being assigned. e.g:

speaker ids of

0,1,2,3,4,5,6,7,8,9,10

are sorted to this

0,10,1,2,3,4,5,6,7,8,9

and then the ids go up from left to right. Leftmost gets the first slice of the embedding, and so on.

So if you added another speaker as ID 11, all the ID's after 0,10 are misaligned. (as shown below)

0,1,2,3,4,5,6,7,8,9,10,11

to

0,10,11,1,2,3,4,5,6,7,8,9

This means adding another ID will (likely) require retraining most of the speaker embedding layer anyway.

(I'd like to remove this and just have the speaker_ids on the filelists directly translate to internal, but now I've got old models getting in the way) (I hope to fix this when switching over to Flow-TTS, we'll see what happens then).


I trained a speaker for 2.5k steps with a batch size of 26, and so far it's low pitched and unintelligible. I'm not sure if I messed something up and something clearer should play or if I should train it more. Thoughts?

Not sure.


I tried downloading their pretrained 128 channel WaveFlow model, but it's in their own .pdparams format and I'm not sure how to convert it to your weights. Also, if I'd like to train my own model, I'm guessing I should use your waveglow_latest directory?

That's the one. I don't think I've got any up-to-date configs uploaded yet so nag me if you get onto that. Also, WaveFlow seems to run much slower than the claims in the paper. I'm not sure why so I don't really recommend using it over other solutions right now. I'd like to try running WaveFlow inference with TorchScript to get the compiler optimizations and see how performance changes, but that's got to happen later when I'm not focused on other bits. (or to be specific, at this exact moment I'm waiting on datasets to download :watch: )


Edit: If you wanted to add another speaker without shifting the speaker ids, you could try 999 and just conform to the alphabetical nature of the current system. I wouldn't call it a solution but if you really need to keep the speaker ids aligned.

DatGuy1 commented 4 years ago

I can't test the PaddlePaddle model for mel-to-wave since running torch.load on the .pdparams file fails. I'm guessing I need to convert it?

CookiePPP commented 4 years ago

@DatGuy1 Not sure. Never tried it. :man_shrugging:

DatGuy1 commented 4 years ago

Hmm, how large of an impact do you think using a random speaker from one of your WaveGlow models have? Not sure if the issue with my speaker is due to a bad wave on mel or mel to wave model.

CookiePPP commented 4 years ago

Hmm, how large of an impact do you think using a random speaker from one of your WaveGlow models have?

I think it'd be a small difference. Adding the speaker ids made very little difference (at least within the first day~ish of training).

DatGuy1 commented 4 years ago

In the datasets for your speakers with a small amount of data, how was your filelist formatted? Was it just file|text|id? Did the text have any start/EOS tokens?

CookiePPP commented 4 years ago

file|text|id

no start/end tokens.

DatGuy1 commented 4 years ago

And do you recall what your average max attention weight and loss for validation was?

CookiePPP commented 4 years ago

average max attention weight

0.72~

val_loss

somewhere 0.25 ish. Really depends on how much drop_frame_rate you use (higher will increase loss/make the spectrogram more blurred, but increases stability up to a point).

https://github.com/CookiePPP/codedump/blob/master/tacotron2-PPP-1.3.0/run_every_epoch.py#L8

DatGuy1 commented 4 years ago

Two things:

  1. I'm very noobish with the learning rate. My average max attention weight went from 0.58 to 0.6 in 5k steps, and validation loss has actually gone up from 2 to 2.6. I'm assuming those numbers are wrong. My starting learning_rate is 0.1e-5 as is default which I'm not sure why I didn't change. Weight decay and run_every_epoch.py are also default. What should I change the starting learning rate and params in run_every_epoch to?

  2. When I open the inferred audio in an audio editing program and manually change the pitch + ~100%, then it sounds like it could be decent but I can't tell due to the distortion. Only thing I could think of that could impact it is in the my dataset the audio files are 22050Hz while yours are 44100Hz, but I changed that in the hparams. Anything I'm missing?

CookiePPP commented 4 years ago

@DatGuy1

What should I change the starting learning rate and params in run_every_epoch to?

run_every_epoch will override anything in hparams. It runs 'live' and I manually adjust the learning rate using it. https://www.desmos.com/calculator/x6fkjjnhut decay_start, A_, B_, C_ decide the learning rate and can be messed with using the link above, where x = iteration, y = learning_rate. Learning rate should start around 1e-3, and decrease to 1e-5 from 100,000 iterations to 300,00 iterations.

for example, a run_every_epoch like below would work fine;

current_iteration = iteration
decay_start = 300000
if current_iteration < 100000:
    A_ = 100e-5
elif current_iteration < 150000:
    A_ = 50e-5
elif current_iteration < 200000:
    A_ = 20e-5
elif current_iteration < 250000:
    A_ = 10e-5
elif current_iteration < 300000:
    A_ = 5e-5
else:
    A_ = 5e-5
B_ = 30000
C_ = 0e-5
min_learning_rate = 1e-6
epochs_between_updates = 1
drop_frame_rate = min(0.000010 * max(current_iteration-5000,0), 0.2) # linearly increase DFR from 0.0 to 0.2 from iteration 5000 to 45000.
p_teacher_forcing = 0.95
teacher_force_till = 0
val_p_teacher_forcing=0.80
val_teacher_force_till=30
grad_clip_thresh = 1.5

Anything I'm missing? https://github.com/NVIDIA/tacotron2/blob/master/hparams.py#L36-L42

sampling_rate=22050,
filter_length=1024,
hop_length=256,
win_length=1024,
n_mel_channels=80,
mel_fmin=0.0,
mel_fmax=8000.0,

Everything other than n_mel_channels will need to be updated for a different sampling rate. I wouldn't recommend learning default params from this repo, this is called codedump because it's where I dump all of my code. :smile:

DatGuy1 commented 4 years ago

Thanks! You're using numbers in the hundred thousands, but since I'm warm starting it's starting at 0 with ~6/s per iteration. Do I reduce your numbers to something more along the lines of divided by 100?

CookiePPP commented 4 years ago

@DatGuy1 If you're warm starting from one of the already existing models then you'll need to find your own ideal learning rates. The schedule I showed would be used if training from scratch.

DatGuy1 commented 4 years ago

Say you're adding a new speaker with a small dataset. You set the training and validation filelists, generate the mels and torchmoji emotions. Then, do you start train with warm start? Or normal? And what would your learning rate be?

CookiePPP commented 4 years ago

warm start would be faster, and learning rate would be 100e-5 to start with.

DatGuy1 commented 4 years ago

I trained a model to a mere 2k steps with these hparams:

sampling_rate=22050, filter_length=1024, hop_length=256, win_length=1024, n_mel_channels=80, mel_fmin=0.0, mel_fmax=8000.0

And tried to generate with the "LargeWaveGlow V3.5" model. As expected, it failed due to my model having 176 channels (?) instead of the expected 256. So I implemented using NVIDIA's pretrained WaveGlow model, and despite still being unintelligible the pitch sounds right. However, the alignments I see in tensorboard are pretty much nothing. Before I changed those settings the alignments looked good and were very linear. Basically I'm wondering if what I changed broke the whole process?

DatGuy1 commented 4 years ago

I'm thinking I'll try to remake the dataset with 44100Hz audio and be a guinea pig for your cookietts repo.

DatGuy1 commented 4 years ago

@CookiePPP when I try to add a new speaker it overwrites the previously trained speaker. I think it's something to do with the weird ordering of the speaker lookup table. I started my first voice with --warm_start_force and speaker ID 297, since it was the next free number. That worked well enough. Then I trained another voice with ID 298, but whenever I tried to use the 297 voice then the 297 voice returned to a voice that sounds like it was in the original model. What do I do?

CookiePPP commented 4 years ago

@DatGuy1 I don't fully understand, but if you want to train multiple new voices you should train them together.

DatGuy1 commented 4 years ago

Huh. So you mean take my filelists and merge them together? I thought I would be able to add speakers as time goes on.