152334H / DL-Art-School

TorToiSe fine-tuning with DLAS
GNU Affero General Public License v3.0
208 stars 95 forks source link

Figure out the best training hyperparameters #1

Open 152334H opened 1 year ago

152334H commented 1 year ago

The numbers written in ./experiments/EXAMPLE_gpt.yml were picked completely at random! It is very likely the numbers can be better, so long as people are willing to test and see what works.

Please post results here if you change any of the parameters, even if it completely fails!

152334H commented 1 year ago

experiment 2

For my 2nd experiment (the first one being the one on the README page), I:

Both of these moves appear to have been mistakes. My mixed dataset was highly imbalanced, with >70% of the speech going to a single narrator alone (in a dataset of 100 speakers); this caused all voice outputs to be severely biased towards the most common speaker. I also observed much more noise in the resultant outputs, which might have to do with the dataset or with the higher learning rate or with the lack of other model fine-tunes.

Might commit results later, but my conclusions here are:

152334H commented 1 year ago

experiment 3

This was the one where I first used the colab notebook.

image

It went pretty well, which was surprising because the dataset had <200 samples.

However, this only really worked because I manually adjusted a whole bunch of parameters down. That led me to develop automatic calculations for some parameters based on the dataset.

experiment 4

image

This was just a redo of the previous experiment with the new automatic parameter system. Worked well enough.

152334H commented 1 year ago

experiment 5

image

This was my 2nd attempt at a multispeaker training session. This time, I capped samples for every character at a maximum of 1000 lines (in the training DS). I learned a few things:

152334H commented 1 year ago

experiment 6

Testing on a different dataset this time. Single speaker, female, emotional, fairly large dataset with maybe 1-2k samples.

Now that I've gotten the validation metrics working, I can use those as graphs:

image

This was a disastrous outcome, and the voices were all garbled when I test them. I don't know why, maybe the speaker is too different. I didn't change anything about the training process.

152334H commented 1 year ago

experiment 7

First case of diffusion fine-tuning! It looked amazingly good on the tensorboard graphs:

image

But the results were absolutely horrible! Sounded like random noise, incredible how bad it was.

devilismyfriend commented 1 year ago

I've been running tests on small datasets (15 total samples) and I notice the result sounds stepped, like he's talking through a broken speaker kind of thing, even weirder, the higher the preset you go the more it clears up and sounds good, I tried various combinations and believe it's mainly influenced by the autoregressive sample amount, not sure why it's getting this sort of effect..

Comparison:

All are using the same seed and the same candidate is being compared

Standard Preset - fine-tuned: https://vocaroo.com/1mhvTk3mpPXt

Standard Preset - Original: https://vocaroo.com/1jjsMCN54BZU

Ultra-fast - fine-tuned (notice the hiss and stepping): https://vocaroo.com/1mYxZIWlHhZb

Ultra-fast - Original: https://vocaroo.com/14VuHwH3s5Vw

152334H commented 1 year ago

Training curves && params would be good. It probably overfit on the small amount of data included, which could be made less bad if I manage to fix the conditioning latents problem.

devilismyfriend commented 1 year ago

Training curves && params would be good. It probably overfit on the small amount of data included, which could be made less bad if I manage to fix the conditioning latents problem.

I wonder if this can be further fixed using AudioLDM once they release their audio super-resolution, voicefixer completely destroys the speech

devilismyfriend commented 1 year ago

in regards to the diffusion model, I talked to the dev that wrote on reddit he retrained the vqvae, he said he didn't retrain the diffusion model at all

devilismyfriend commented 1 year ago

btw, check out this thread where neonbjb discusses the gpt training https://github.com/neonbjb/DL-Art-School/issues/10

152334H commented 1 year ago

I'm aware of the cheater latents problem, I discuss the problems with fixing that here, but thanks for the link nonetheless

152334H commented 1 year ago

I wonder if this can be further fixed using AudioLDM once they release their audio super-resolution, voicefixer completely destroys the speech

I haven't checked it out, I'll go do that later

in regards to the diffusion model, I talked to the dev that wrote on reddit he retrained the vqvae, he said he didn't retrain the diffusion model at all

Did he mean to recreate the VQVAE from scratch, or to fine-tune?

devilismyfriend commented 1 year ago

I'm not sure tbh BTW

These configs were shared by neon on some random discussion awhile back, they're different from the ones in the original DL repo, perhaps they could help make sense of how he trained his GPT. tts_flat_autoregressive_inputs_r2.zip

152334H commented 1 year ago

These are all very interesting... they look like the exact configs he used to train the actual tortoise model. This is the first time I've seen the real filepaths to his larger ocotillo transcribed dataset. I can already see some errors I made regarding the diffusion model trainer, like layer drop or lr decay.

This is good. Where was it from?

devilismyfriend commented 1 year ago

https://github.com/neonbjb/tortoise-tts/discussions/124

Ryu1845 commented 1 year ago

Regarding finding good hyperparameters, I think this might be useful. https://github.com/optuna/optuna

xenotropic commented 1 year ago

I ran a bunch of experiments reducing lr (with its helpful bold comment "you should experiment with this value"). Reducing it seems to resolve the "stepped, like he's talking through a broken speaker kind of thing, even weirder, the higher the preset you go the more it clears up" situation. I found that values between 1e-7 and 5e-8 worked best (kinda hard to tell within that range which is best), avoiding both the unsmooth robot-like tonality of zero-shot (i.e., original model) and the stepped sound of 1e-5 . I'm using ~180 samples, .85/.15 train/validate, niter (I'm assuming this is "number of iterations" and synonymous with "steps") of 1800 so ~12 epochs, and then gen_lr_steps [462, 924, 1386, 1618] so stepping down the lr every four epochs. At least, that's what I think I'm doing anyway (not an ML genius), and it sounds pretty good. I'm training on a pretty normal voice that isn't that far off of the libritts-ish voices so may not need as much training as other voices would.

The thing that I still plagues me is issue 237 in the original tortoise repo: repeats (so, an inference issue, not Hparams). Posting on that in #61 to keep topics clean.

xenotropic commented 1 year ago

Sorry, ignore last comment. I hadn't comprehended steps well enough (i.e., "one batch of batch_size is 1 unit/step" in the example yml. I had a batch size of 77 (so two batches per epoch with 154 training samples), so 1800 steps was hundreds of epochs. Interesting to experiment with low learning rate, lots of iterations, I guess; nothing good enough to recommend. Works much better with lr 1e-5 and 5 to 8 epochs.

Are there units to the y-axis val_loss_text_ce? Is that just an arbitrary loss function? Trying to figure out if one can infer anything from the difference between it converging on, say 1.31 in experiment 6 here versus on 4.4 here in one of my recent experiments (or other future graphs), or if it is just more about the shape of the curve.

image

FurkanGozukara commented 1 year ago

are you changing any temperature or or top p when using tortoise fast? so lower learning rate works better?

xenotropic commented 1 year ago

Caveat that I'm just a hobbyist here so my theoretical conceptions of these things are of a "I read a blog post about them" level. But I can report I have done experiments and can't discern any meaningful difference when moving the temperature or top_p dials (from .5 to .95 in each case). Or repetition_penalty or length_penalty for that matter -- nothing. At first I thought low top_p made the sound more "boring" (less prosody) but listening again now I think maybe that's just a bias from having read the docs, which say that's what it does.