geneing / WaveRNN-Pytorch

Fatcord's Alternative WaveRNN (Faster training)
MIT License
132 stars 37 forks source link

Getting error for training with Tacotron #8

Open rishikksh20 opened 5 years ago

rishikksh20 commented 5 years ago

I have used this (https://github.com/Rayhane-mamah/Tacotron-2) implementation for pre-processing. And when I write command for training python3 train.py --dataset Tacotron training_data And getting this error

x = torch.cat([x.unsqueeze(-1), mels, a1[:,:,:-1]], dim=2)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 2. Got 1000 and 1280 in dimension 1 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:83

Error seems to be straight forward that concatenate dimension size don't match, so I debug that and found following sizes of these three tensors:

print(x.unsqueeze(-1).size())    ----- > torch.Size([64, 1280, 1])    
print(mels.size())                      ------ > torch.Size([64, 1000, 80])
print(a1[:,:,:-1].size())               ------- > torch.Size([64, 1000, 31])

clearly see the dimension 1 of x and mels are not equal. @geneing So how I resolve it do I need to do some kind of reshaping or something else.

rishikksh20 commented 5 years ago

Ok by changing the `hop_size' to 200 I am able to resolve issue but I want to train for sample rate 22050, and following settings:

num_mels = 80, 
num_freq = 513,
fft_size = 1024,
hop_size = 256,
sample_rate = 22050

So do have any solution for that, Am I able to train --dataset Tacotron model with these settings ?

G-Wang commented 5 years ago

In hparams.py, you need to make sure your upsample factor multiplies out to be equal to the hop size for processing Mel spectrogram. E.g So if tacotron 2 has hopsize of 256, you can use either (4,8,8) or (4,4,16) for upsample factor.

rishikksh20 commented 5 years ago

@G-Wang Yeah I resolved that but get another error:

Traceback (most recent call last):
  File "train.py", line 444, in <module>
    train_loop(device, model, data_loader, optimizer, checkpoint_dir)
  File "train.py", line 305, in train_loop
    for i, (x, m, y) in enumerate(tqdm(data_loader)):
  File "/usr/local/lib/python3.6/dist-packages/tqdm/_tqdm.py", line 1022, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 623, in __next__
    return self._process_next_batch(batch)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
ValueError: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/ubuntu/Dev/rishikesh/speech_synthensis/symon/WaveRNN-Pytorch/dataset.py", line 132, in discrete_collate
    mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]
  File "/home/ubuntu/Dev/rishikesh/speech_synthensis/symon/WaveRNN-Pytorch/dataset.py", line 132, in <listcomp>
    mel_offsets = [np.random.randint(0, offset) for offset in max_offsets]
  File "mtrand.pyx", line 992, in mtrand.RandomState.randint
ValueError: Range cannot be empty (low >= high) unless no samples are taken

Know the issue is with pre-processed dataset so I am working to resolve it in case you have any idea regarding this, please let me know

geneing commented 5 years ago

@rishikksh20 Could you please set a breakpoint in dataset.py line 132 and check what max_offsets list contains. If it contains negative offsets, then could you please check "batch" list and the shapes of entries.

Basically, max_offsets contains the last column in the input mels that can be used (due to required padding). If the mel input is too short, then max_offset will be negative and the next line will fail.

I think I had one dataset where sentences that had just one or two words, which resulted in very short wav file and very few mel frames.

rishikksh20 commented 5 years ago

@geneing How much hours of data required to generate good voice? Like wavenet generate a good voice from 2 hours of data.

geneing commented 5 years ago

@rishikksh20 I've been using two datasets LJSpeech and M-AILABS (Mary Ann reader). Both are ~24 hours of speech. I haven't tried smaller datasets because I use the same dataset for Tacotron training - in the end what matters to me is voice quality from mel specs produced from text by tacotron. Besides, voice quality evaluation is highly subjective :).

echelon commented 5 years ago

@geneing Sorry for dog piling in this issue, but since you mentioned it, what were the hparams you used with LJSpeech?

I tried the following and trained for 5000 epochs (505000 steps), but the results sound like gibberish. (This is one of the mels in the dataset.)

hop_size=256,
sample_rate=22050,
upsample_factors=(4, 4, 16),

 # shouldn't have any impact, but including for posterity:
save_every_step=5000,
evaluate_every_step=5000,

Like rishikksh20, I'm also using Rayhane-mamah/Tacotron-2 for preprocessing, but I've made the following hparams adjustments there:

tacotron_batch_size = 8,
wavenet_batch_size = 2,

Could it be that by trying to reduce my GPU memory footprint in Tacotron-2 that I've affected my WaveRNN training? Or do I just have bad hparams for WaveRNN that don't account for the 22050 Hz sample rate? Or maybe I'm simply not training long enough?

rishikksh20 commented 5 years ago

Getting very large error for mixture input type :

using noam learning rate decay
no checkpoint specified as --checkpoint argument, creating new model...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:0, running loss:323150838.5625, average loss:1369283.2142478814, current lr:1.475e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.99it/s]
epoch:1, running loss:220346417.3125, average loss:933671.2597987289, current lr:2.95e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:2, running loss:194686806.8125, average loss:824944.0966631356, current lr:4.425e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:3, running loss:193793209.1875, average loss:821157.6660487289, current lr:5.9e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:4, running loss:196869593.96875, average loss:834193.194782839, current lr:7.375e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:5, running loss:191893224.75, average loss:813106.8845338983, current lr:8.85e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:6, running loss:185908305.6875, average loss:787747.0579978813, current lr:0.00010325, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.99it/s]
epoch:7, running loss:181063116.21875, average loss:767216.5941472457, current lr:0.000118, num_pruned:0 (0%)

Is input type mixture working ?

rishikksh20 commented 5 years ago

Ok issue has resolved :

using noam learning rate decay
no checkpoint specified as --checkpoint argument, creating new model...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:0, running loss:1375.3774342536926, average loss:5.827870484125817, current lr:1.475e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:1, running loss:914.5991067886353, average loss:3.875419944019641, current lr:2.95e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:2, running loss:847.640928030014, average loss:3.591698847584805, current lr:4.425e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.98it/s]
epoch:3, running loss:834.7676503658295, average loss:3.5371510608721586, current lr:5.9e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:4, running loss:836.0202960968018, average loss:3.5424588817661093, current lr:7.375e-05, num_pruned:0 (0%)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 236/236 [00:53<00:00,  4.97it/s]
epoch:5, running loss:839.1563384532928, average loss:3.555747196835987, current lr:8.85e-05, num_pruned:0 (0%)

@geneing please change https://github.com/geneing/WaveRNN-Pytorch/blob/7b317c4d930ad8b7405e72c5ace7b6481bdc6f2b/distributions.py#L136 to return -torch.mean(log_sum_exp(log_probs)) But one question remains please respond as per your experience @G-Wang @geneing that is mixture perform better than bits or gaussian , in my case I am using Tacotron 2 as TTS front-end and train WaveRNN with GTA mels.

G-Wang commented 5 years ago

@rishikksh20 using my own TTS front end I've found mu-law 10 bit do it well enogh for me.

rishikksh20 commented 5 years ago

@G-Wang if you don't mind could you tell me which Tacotron implementation you are using and how much hours of data working fine for you. In my case, I code my own tacotron but the structure similar to this Tacotron-2 and have 36 hours of male voice but I am struggling to train WaveRNN with GTA, even after 1 min steps I still get lots of loud noise. My hparams are the following :

        num_mels = 80,  
    rescale = True, 
    rescaling_max = 0.999,
    trim_silence = True,

    fft_size = 1024,
    hop_size = 256,
    sample_rate = 22050, 
    frame_shift_ms = None,

    signal_normalization = True,
    allow_clipping_in_normalization = True,
    symmetric_mels = True, 
    max_abs_value = 4., 

    #Limits
    min_level_db =- 100,
    ref_level_db = 20,
    fmin = 125,
    fmax = 7600,

It's been great pleasure if you help me bit.

G-Wang commented 5 years ago

I'm using tacotron 2 variants, training on audiobook datasets as well as ljspeech. Have you looked into where the loud noises are coming from? Do you get these loud noises if you invert your TTS linear spectrogram with griffin-Lim or LWS, or just by inspecting the generated spectrograms? If not then perhaps you haven't matched up the mel features exactly between TTS and Wavernn?

rishikksh20 commented 5 years ago

@G-Wang Tacotron 2 trained on input with signal normalization in [0,1]. By the way thanks for your help.

G-Wang commented 5 years ago

@rishikksh20 another thing to look into if you haven't already is exactly how preprocessing is done for your setup. Note in my vocoder repo(not sure if gening has changed it in his fork) I use lws to compute Mel features in audio.py, Becuase I like to use lws as vocoder over griffin Lim. But I see in other repos like Nvidia tacotron 2 where they use librosa to compute Mel features etc. So if you want TTS front end to match vocoder, make sure both are trained on the same mel/linear spectrogram generated by either lws or librosa in audio preprocessing. Also need to note other things like preemphasis, etc that occur in some audio preprocessing script but not in others.