NVIDIA / tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
BSD 3-Clause "New" or "Revised" License
5.07k stars 1.38k forks source link

Argument #4: Padding size should be less than the corresponding input dimension #113

Closed mahdeto closed 5 years ago

mahdeto commented 5 years ago

I am seeing this error:

Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (512, 512) at dimension 3 of input [1, 1, 1, 220]

after trying to train using this command: python train.py --output_directory=outdir --log_directory=logdir

I am using pytorch 1.0 and python 3.6 with a single Tesla V100 gpu and I am using my own data set which I processed to be identical to the LJSpeech format and have changed the filelists accordingly.

The full log is:

FP16 Run: False
Dynamic Loss Scaling: True
Distributed Run: False
cuDNN Enabled: True
cuDNN Benchmark: False
Epoch: 0
Train loss 0 28.056627 Grad Norm 7.857424 5.01s/it
Traceback (most recent call last):
  File "train.py", line 284, in <module>
    args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
  File "train.py", line 242, in train
    hparams.distributed_run, rank)
  File "train.py", line 133, in validate
    for i, batch in enumerate(val_loader):
  File "/home/mahdeto/anaconda2/envs/tacotron2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 637, in __next__
    return self._process_next_batch(batch)
  File "/home/mahdeto/anaconda2/envs/tacotron2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/home/mahdeto/anaconda2/envs/tacotron2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/mahdeto/anaconda2/envs/tacotron2/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/raid1/mahdeto/repo/tacotron2/data_utils.py", line 61, in __getitem__
    return self.get_mel_text_pair(self.audiopaths_and_text[index])
  File "/raid1/mahdeto/repo/tacotron2/data_utils.py", line 34, in get_mel_text_pair
    mel = self.get_mel(audiopath)
  File "/raid1/mahdeto/repo/tacotron2/data_utils.py", line 46, in get_mel
    melspec = self.stft.mel_spectrogram(audio_norm)
  File "/raid1/mahdeto/repo/tacotron2/layers.py", line 76, in mel_spectrogram
    magnitudes, phases = self.stft_fn.transform(y)
  File "/raid1/mahdeto/repo/tacotron2/stft.py", line 88, in transform
    mode='reflect')
  File "/home/mahdeto/anaconda2/envs/tacotron2/lib/python3.6/site-packages/torch/nn/functional.py", line 2685, in pad
    ret = torch._C._nn.reflection_pad2d(input, pad)
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (512, 512) at dimension 3 of input [1, 1, 1, 220]

Please help. Thanks!

mahdeto commented 5 years ago

In case someone runs into this. Turns out I had some corrupted wav files (very small but had some speech) that caused this.

ican24 commented 4 years ago

Additionally this case is possible, when the wav file is a stereo, not mono. Tacotron is using 16kHz 16bit mono for training.

RESDXChgfore9hing commented 3 years ago

I had the same problem,now. Codec:PCM S16LE(s16l) Type:Audio Channels:Mono Sample rate:16000Hz Bits per sample:16

Maybe is a codec problem,Above info is obtained from VLC,anyone can inform us the working setup?

Quite confusing,I also edited the hparam file respectively does not seem to work.

Originally 22khz,now 16kHz 16bit mono for training. Does not seems to work.hmmmm..

RESDXChgfore9hing commented 3 years ago

Oh thanks i also suspected those weird small files it is not corrupted but,something is wrong when used as input.Solved.THANKS.

jazz215 commented 3 years ago

In case someone runs into this. Turns out I had some corrupted wav files (very small but had some speech) that caused this.

I am also getting this error: RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (512, 512) at dimension 3 of input 4 when i am doing inference: what do you mean by corrupted wav files? when i use 16khz file , i was getting sample rate error so i had to switch back to 22khz. My inference script: python inference.py --tacotron2 output/checkpoint_Tacotron2_last.pt --waveglow output/checkpoint_WaveGlow_last.pt --cpu -o output/ -i phrases1/phrase.txt I can run the same inference with prretrained models, but not with my trained models.