NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
853 stars 187 forks source link

parse_output error with Blizzard2013 data #104

Open jinhonglu opened 2 years ago

jinhonglu commented 2 years ago

Hi, I am trying to run mellotron on Blizzard2013 dataset, I aligned the audio with some alignment tool, where each resulted audio is about 15-25s long.

However, I am facing parse_output error as

Traceback (most recent call last):
  File "train.py", line 286, in <module>
    args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
  File "train.py", line 210, in train
    y_pred = model(x)
  File "Desktop/py3_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "Desktop/py3_env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "Desktop/py3_env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "Desktop/py3_env/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
  File "Desktop/py3_env/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 5.
Original Traceback (most recent call last):
  File "Desktop/py3_env/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "Desktop/py3_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "Desktop/PDAEmotion/mellotron/model.py", line 632, in forward
  File "Desktop/PDAEmotion/mellotron/model.py", line 603, in parse_output
    outputs[0].data.masked_fill_(mask, 0.0)
RuntimeError: The expanded size of the tensor (891) must match the existing size (349) at non-singleton dimension 2.  Target sizes: [16, 80, 891].  Tensor sizes: [16, 80, 349]

I am reading the paper and know that the actual implementation uses audio that is less than 10s. I just wonder this problem is caused by the length of the audio in my dataset? Or not?

How should I fix this?

Also, I changed some of the code to support multi-GPUs with DataParalle

def load_model(hparams):
  device = torch.device('cuda:4')
  model = Tacotron2(hparams).to(device)
  if hparams.fp16_run:
      model.decoder.attention_layer.score_mask_value = finfo('float16').min

  if torch.cuda.device_count() > 1:
      model = DataParallel(model, device_ids=[4, 5])
return model

Thank you.