MycroftAI / mimic2

Text to Speech engine based on the Tacotron architecture, initially implemented by Keith Ito.
Apache License 2.0
579 stars 103 forks source link

Weird alignment images and bad sound even after 100k steps. #27

Open Ittiz opened 5 years ago

Ittiz commented 5 years ago

So I've been trying to train this on the LJSpeech data set since it seemed like the most solid one out there. However I've been having an issue where there is a messy band across the alignment even after it finds alignment the disturbance remains. The step audios sound great after 20k steps or so, but if you synthesize using the demo server it sound like garbled junk even after 100k steps. Here are some alignment images from when I did 100k on the CPU:

step-1000-align This is step 1k, you can already see the band.

step-50000-align 50k steps, band still there, but starting to align finally!

step-100000-align 100k step, alignment improving, but band still there!

I used the default hparams for the CPU run. Then I decided to use the GPU. The GPU is a K620 with limited memory at 2Gb. So I had to set the hparams like this to not OOM:

    # Audio:
    num_mels=80,
    num_freq=1025,
    min_mel_freq=125,
    max_mel_freq=7600,
    sample_rate=22050,
    frame_length_ms=50,
    frame_shift_ms=12.5,
    min_level_db=-100,
    ref_level_db=20,

    #MAILABS trim params
    trim_fft_size=1024,
    trim_hop_size=256,
    trim_top_db=40,

    # Model:
    # TODO: add more configurable hparams
    outputs_per_step=5,
    embedding_dim=512,

    # Training:
    batch_size=16,
    adam_beta1=0.9,
    adam_beta2=0.999,
    initial_learning_rate=0.0015,
    learning_rate_decay_halflife=100000,
    use_cmudict=False,   # Use CMUDict during training to learn pronunciation of ARPAbet phonemes

    # Eval:
    max_iters=200,
    griffin_lim_iters=50,
    power=1.5,    

Note I had to bring the batch size down to conserve memory and I also changed the frequency to 22050 from 22000 because that's what was listed in the data set. I thought that may be the issue. I only ran it for 12 hours so I didn't get a lot of steps but here are the results:

step-1000-align 1k steps, hmm looks like that dang band again to me!

step-12000-align 12k steps, this is where I stopped it because I didn't want to bother wasting more time, but the band is still there looking stronger than ever!

Anyone have any clue what could be causing this issue? Is there anything I can tweak in the hparams to correct this? Could it be an issue with the code? On a side note if I use the demo server or listen to the alignment clips, they are much MUCH louder than the sample data. I'm not sure if that's related or controllable some how.

el-tocino commented 5 years ago

Your GPU needs more ram to be doing training. Per other tacotron repo comments, you should try and have batch size 32 or above to get alignment.

I also get the triangle charts with my dataset, not sure what causes that.

Ittiz commented 5 years ago

Like I said I did over 100k steps using batch size of 32 with CPUs and 48gbs of ram. Same problem whether I've got the CPU or GPU. I just tried the CPU with a batch size of 64 and got the same issue. A band shows across the top:

step-3000-align

Again, I'm not sure if it's relevant but the wavs it generates are WAY louder than the original training wavs. I have to turn the training wavs volumes to about 75% to hear them well on vlc and mplayer. The clips generated by Mimic2 I have to set to 3% and it's still twice as loud as the training clips at 75%.

Not sure what you mean by triangles? I'm complaining about the band that never goes away! You can see it in all my alignment images.

el-tocino commented 5 years ago

See your 100k step picture for what I mean by triangle. Tends to echo out still when that occurs. I'm not sure why yours ending up with the test wavs being loud. I haven't been able to train LJ on mimic2 successfully, though one of the mycroft folks said he was able to do.

Ittiz commented 5 years ago

That just means that the training is working: https://github.com/keithito/tacotron/issues/144

So you've tried training in the LJs data set as well? Did you get the same interference I'm getting in your alignment graphs and sound synthases?

el-tocino commented 5 years ago

It doesn't work, though. The generated samples from models that have the weird align/fuzzy bar triangley thing end up being either filled echo or lose coherency quickly. Aligned models from previous iterations of tacotron/mimic2 I've run haven't had those issues, and their alignment charts are much closer to ideal (ie, just a line going bottom left to upper right).

Ittiz commented 5 years ago

So what changed that is creating this issue? Do you know when around this issue started to crop up?

el-tocino commented 5 years ago

There was a bunch of stuff updated last September or so. For a test, try the following. Preprocess all your data with the mimic2 repo. Then, clone keithito's tacotron repo and use it to do the training with for 25k or so, by which time you should see normal alignment.

Ittiz commented 5 years ago

Mimic2 crabs about "bias not found in checkpoint" even when the data was preprocessed with Mimic2. I guess I'll have to figure this out on my own.

el-tocino commented 5 years ago

Did you clear out previous training run step/model/checkpoints?

Ittiz commented 5 years ago

I cut and paste them into a different folder.

Ittiz commented 5 years ago

Okie dokie! seems I fixed it by merging Mimic2 and Kiethito's repos in my own fork. I'm not home so I haven't listened to it yet but the alignment is looking much better. If all sounds well I'll push the changes to my fork and other people can test it out.

step-23000-align

Ittiz commented 5 years ago

So more issues, the thing is still super loud. Also it only aligns some times, even with my modifications. I have a feeling the remaining issues are volume related. For now I'm just going to use Tacotron. So I modified Tacotron so I can use it to interface with MyCroft and it seems to be working.

Ruthvicp commented 5 years ago

I have used a different dataset (private) and trained for 18k steps using the existing mimic2 (master branch). I was able to get good alignment and also a decent voice.

image

Could you please share your plots generated using this

Ittiz commented 5 years ago

The plots above were generated using mimic2 on the LJs data set. It could be there is something going on in particular with the LJs data set that makes it not work well Mimic2, I dunno.