NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 184 forks source link

error in the model training on libritts data #76

Closed raymond00000 closed 4 years ago

raymond00000 commented 4 years ago

Hi, I tried to repeat the model training on libritts data.

I downloaded the libritts data set.

I updated the path in filelist.txt.

I changed the hparams.py.

        training_files='filelists/libritts_train_clean_100_audiopath_text_sid_shorterthan10s_atleast5min_train_filelist.txt',
        validation_files='filelists/libritts_train_clean_100_audiopath_text_sid_atleast5min_val_filelist.txt',
       sampling_rate=24000,

then i started the training. but i got below weird error.

RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0

but i used soxi -d, I did not find any audio length is <=0 in the filelist. /8312/279790/8312_279790_000034_000002.wav 8.830000s

did anyone face this error and know how to resolve it? many thanks for advice.

Train loss 39 2.412151 Grad Norm 5.906334 3.40s/it
Traceback (most recent call last):
  File "train.py", line 286, in <module>
    args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
  File "train.py", line 210, in train
    y_pred = model(x)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/temp/mellotron/model.py", line 604, in forward
    embedded_gst = self.gst(targets, output_lengths)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/temp/mellotron/modules.py", line 158, in forward
    enc_out = self.encoder(inputs, input_lengths=input_lengths)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/temp/mellotron/modules.py", line 75, in forward
    out, input_lengths, batch_first=True, enforce_sorted=False)
  File "/home/anaconda3/lib/python3.7/site-packages/torch/nn/utils/rnn.py", line 244, in pack_padded_sequence
    _VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0

(I got no problem on using ljs data.)

pneumoman commented 4 years ago

I have created a branch in my repo that seems to fix this https://github.com/pneumoman/mellotron/tree/Fix_Short_Input_Lengths

ustraymond commented 4 years ago

I have created a branch in my repo that seems to fix this https://github.com/pneumoman/mellotron/tree/Fix_Short_Input_Lengths

Thanks a lot for help! I will try testing your fix.

But may I know what is the problem? is it becoz the reference encoder requires MEL minimum length of 65? given 1 mel = 12.5ms, if audio length less than 65 x 12.5ms = 0.8s, then this will trigger short input length error?

btw, should this line be:

        torch.LongTensor([len(x[0]) for x in batch]),

x[0] -> x[1]?

        torch.LongTensor([len(x[1]) for x in batch]),

where x[0] is the length of text? x[1] is the length of MEL?

Thanks a lot.

pneumoman commented 4 years ago

They are using input_lengths which I believe is the length of the encoded text. I don't really understand the change that was made in modules.py, but in implementing, it there is a scaling that occurs where the input_length is divided by two raised to the power of the number of convolutions in the ReferenceEncoder. Then it is rounded, and in doing so, some samples become zero. This call nn.utils.rnn.pack_padded_sequence needs the length greater than zero, causing the error. I don't know but a simpler fix might be that instead of rounding down you round up.

Also, I made the limit 65 instead of 64 to ensure the included had some length - I have a feeling that there might be a need to make the number even larger, but as I said I don't really know what this change is achieving

I did not change that line, it's the same as what's in master line number 120?

ustraymond commented 4 years ago

yes, i am sorry, I made mistake, you did not change that line, it is same as in master.

you are right: "input_lengths which I believe is the length of the encoded text".

Thanks a lot for explaining the root case of the error due to "some samples become zero.." in the code.

"I don't really know what this change is achieving", me too, I am still reading the code. I will update if I get new understanding.

thx so much!

pneumoman commented 4 years ago

@ustraymond Hey wondering if you've seen this (i'm running a very bastardized version of mellotron so not sure it's my fault) but I'm seeing an error happening in Tacotron2.parse_output.

Traceback (most recent call last): File "train.py", line 325, in args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 220, in train y_pred = model(x) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/workspace/mellotron/model.py", line 675, in forward output_lengths) File "/workspace/mellotron/model.py", line 646, in parse_output outputs[0].data.maskedfill(mask, 0.0) RuntimeError: The expanded size of the tensor (517) must match the existing size (460) at non-singleton dimension 2. Target sizes: [4, 80, 517]. Tensor sizes: [4, 80, 460]

curious if you hit this too.
Note the line numbers are probably wrong for you

raymond00000 commented 4 years ago

I tested your code fix:

I pushed the latest master and patched your change in data_ulti.py. used the libritts filelist.

It worked, it ran 3000 more steps without the length error.

In another words, I am sorry. I did not hit the problem you mentioned.

hongyuntw commented 3 years ago

They are using input_lengths which I believe is the length of the encoded text. I don't really understand the change that was made in modules.py, but in implementing, it there is a scaling that occurs where the input_length is divided by two raised to the power of the number of convolutions in the ReferenceEncoder. Then it is rounded, and in doing so, some samples become zero. This call nn.utils.rnn.pack_padded_sequence needs the length greater than zero, causing the error. I don't know but a simpler fix might be that instead of rounding down you round up.

Also, I made the limit 65 instead of 64 to ensure the included had some length - I have a feeling that there might be a need to make the number even larger, but as I said I don't really know what this change is achieving

I did not change that line, it's the same as what's in master line number 120?

@pneumoman hello I hit this error too How did you fix that error? thanks I guess the reason is in each batch the batch size is not the same...?

pneumoman commented 3 years ago

@hongyuntw did you checkout the branch I referenced above?

jinhonglu commented 3 years ago

@ustraymond Hey wondering if you've seen this (i'm running a very bastardized version of mellotron so not sure it's my fault) but I'm seeing an error happening in Tacotron2.parse_output.

Traceback (most recent call last): File "train.py", line 325, in args.warm_start, args.n_gpus, args.rank, args.group_name, hparams) File "train.py", line 220, in train y_pred = model(x) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/workspace/mellotron/model.py", line 675, in forward output_lengths) File "/workspace/mellotron/model.py", line 646, in parse_output outputs[0].data.maskedfill(mask, 0.0) RuntimeError: The expanded size of the tensor (517) must match the existing size (460) at non-singleton dimension 2. Target sizes: [4, 80, 517]. Tensor sizes: [4, 80, 460]

curious if you hit this too. Note the line numbers are probably wrong for you

Hi I have met this error while I using Blizzard2013 dataset, have you fixed this? Another question is that, while I reading the paper, the author filtered out all the audio that is longer than 10s, could this be the reason of the above problem?