train.py now broken - Githubissues

CookiePPP / cookietts

[Last Updated 2021] TTS from Cookie. Messy and experimental!

BSD 3-Clause "New" or "Revised" License

43 stars 8 forks source link

train.py now broken #16

Closed DatGuy1 closed 4 years ago

DatGuy1 commented 4 years ago

If first training iteration has gradient overflow (and is skipped), due to this change an UnboundLocalError: local variable 'average_loss' referenced before assignment is thrown.

CookiePPP commented 4 years ago

Give me a line number or screenshot?

DatGuy1 commented 4 years ago

I linked the line in the commit. Screenshot

CookiePPP commented 4 years ago

I don't get any line highlighted or scrolled to when I use that link.

CookiePPP commented 4 years ago

anyway https://github.com/CookiePPP/cookietts/commit/a48899296b840b5f053f52e7573a9664a880c993

DatGuy1 commented 4 years ago

I don't get any line highlighted or scrolled to when I use that link.

I think it's because train.py is hidden by default because it was a large diff.

DatGuy1 commented 4 years ago

@CookiePPP another error. Not sure the cause but on any validation (currently using LJSpeech):

Traceback (most recent call last): File "train.py", line 905, in train(args, args.rank, args.group_name, hparams) File "train.py", line 773, in train valattloss, * = validate(hparams, args, file_losses, model, criterion, valset, best_val_loss_dict, iteration, collate_fn, logger, 0, 0.0, teacher_force=2)# infer File "train.py", line 432, in validate loss_dict_total = {k: v/(i+1) for k, v in loss_dict_total.items()} AttributeError: 'NoneType' object has no attribute 'items'

CookiePPP commented 4 years ago

@DatGuy1 Can you check your validation file(s)? This error would only occur if your validation set was smaller than your batch size (which is fucking unlikely under normal conditions).

DatGuy1 commented 4 years ago

val_batch_size is at default of 32. There's 610 validation files and they're all checked. Full output here

CookiePPP commented 4 years ago

Alright, 2 things. https://github.com/CookiePPP/cookietts/blob/a48899296b840b5f053f52e7573a9664a880c993/CookieTTS/_2_ttm/tacotron2_tm/hparams.py#L72-L84

Which data_source 'mode' are you using?

Did you add speaker ids to any of the datasets you're testing? This repo isn't tested with single speaker datasets though I wouldn't have expected any failures around the validation area due to wonky/missing IDs.

DatGuy1 commented 4 years ago

data_source is 0. I'm testing with a single speaker and LJSpeech just to make myself familiar with it before moving onto anything serious. It must've been one of the recent commits, possibly overflow related, since I could train it fine before.

From my testing: valset is fine with 610 files just before entering the for loop. Length of val_loader is 19. The only modification I made that I can think of is disabling all instances of distribution, i.e. num_workers = 0, distributed_run = False, etc.

CookiePPP commented 4 years ago

https://github.com/CookiePPP/cookietts/blob/experimental/CookieTTS/_2_ttm/tacotron2_tm/train.py#L396

I added this line a little ago. It will change the 2nd pass of validation to sample from each speaker equally, so the inference plots on tensorboard don't massively overweight speakers with more data. I think that's failing when using single-speaker datasets, though I don't see the exact line inside the function that's messing up. I'll add a hparam you can flip in a sec.

CookiePPP commented 4 years ago

https://github.com/CookiePPP/cookietts/commit/726249e212b530ca64b7c7b59cd6f0bf59f8a2d2

inference_equally_sample_speakers=True,# Will change the 'inference' results to use the same number of files from each speaker.
                                       # This makes sense if the speakers you want to clone aren't the same as the speakers with the most audio data.

DatGuy1 commented 4 years ago

Yep, that's the one. Works now.