Division by Zero when training

LukeB42 commented 6 years ago

  File "samplernn-pytorch/trainer/__init__.py", line 45, in call_plugins
    getattr(plugin, queue_name)(*args)
  File "/usr/local/lib/python3.6/site-packages/torch/utils/trainer/plugins/monitor.py", line 56, in epoch
    stats['epoch_mean'] = epoch_stats[0] / epoch_stats[1]
ZeroDivisionError: division by zero

This is with PyTorch 0.3.0.post4.

sbl commented 6 years ago

Same behavior here:

  File "train.py", line 337, in <module>
    main(**vars(parser.parse_args()))
  File "train.py", line 235, in main
    trainer.run(params['epoch_limit'])
  File "/home/stephen/src/samplernn-pytorch/trainer/__init__.py", line 57, in run
    self.call_plugins('epoch', self.epochs)
  File "/home/stephen/src/samplernn-pytorch/trainer/__init__.py", line 44, in call_plugins
    getattr(plugin, queue_name)(*args)
  File "/home/stephen/anaconda3/lib/python3.6/site-packages/torch/utils/trainer/plugins/monitor.py", line 56, in epoch
    stats['epoch_mean'] = epoch_stats[0] / epoch_stats[1]
ZeroDivisionError: division by zero

koz4k commented 6 years ago

Duplicate of #10.

The problem is that for validation we discard the last (incomplete) minibatch so it doesn't skew the result, as it might be smaller than the rest and we average the loss over minibatches with equal weights. Specifically, if you only have one minibatch, it tries to average over an empty set, hence division by zero. This could be handled better and we're planning to do that in the near future.

LukeB42 commented 6 years ago

@koz4k thanks for the response but what do you suggest for fixing this myself for the time being?

returning if args is empty doesn't work and wrapping the function body in a try / except causes the program to exit at around the 1,000 exceptions mark.

koz4k commented 6 years ago

Sorry, I was wrong - this is related to the size of the training set, not validation set. Either way, the solution is to lower the batch size or use a bigger dataset. I would recommend a bigger dataset, because with such a small one you might not be able to achieve good results anyway.

LukeB42 commented 6 years ago

@koz4k OK, thanks for explaining that.

LukeB42 commented 6 years ago

@koz4k Following your suggestion using

python train.py --exp TEST --frame_sizes 16 4 --n_rnn 2 --dataset custom --batch_size 64

I'm getting the following result:

Traceback (most recent call last):
  File "train.py", line 360, in <module>
    main(**vars(parser.parse_args()))
  File "train.py", line 258, in main
    trainer.run(params['epoch_limit'])
  File "pytorch-samplernn/trainer/__init__.py", line 56, in run
    self.train()
  File "pytorch-samplernn/trainer/__init__.py", line 61, in train
    enumerate(self.dataset, self.iterations + 1):
  File "pytorch-samplernn/dataset.py", line 51, in __iter__
    for batch in super().__iter__():
  File "/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 188, in __next__
    batch = self.collate_fn([self.dataset[i] for i in indices])
  File "/usr/local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 96, in default_collate
    return torch.stack(batch, 0, out=out)
  File "/usr/local/lib/python3.6/site-packages/torch/functional.py", line 64, in stack
    return torch.cat(inputs, dim)
RuntimeError: inconsistent tensor sizes at /pytorch/torch/lib/TH/generic/THTensorMath.c:2864

What do you suggest I do to fix this for the time being?

comeweber commented 6 years ago

Are you sure that all the .wav files in your dataset directory have the same duration ?

LukeB42 commented 6 years ago

@comeweber @koz4k Many thanks for your help, both of you. It's now stably training! Using wav files that're 8 seconds long and --batch_size of 32. Many thanks.

niuqun commented 6 years ago

@LukeB42 Could you share your file structure with custom folder? I cannot use the youtube-dl to generate the training data right now, so I download a audio file myself. Although I have 8 seconds chunks, the training goes wrong with following errors:

Traceback (most recent call last): File "train.py", line 360, in main(**vars(parser.parse_args())) File "train.py", line 258, in main trainer.run(params['epoch_limit']) File "/root/Documents/samplernn-pytorch-master/trainer/init.py", line 56, in run self.train() File "/root/Documents/samplernn-pytorch-master/trainer/init.py", line 61, in train enumerate(self.dataset, self.iterations + 1): File "/root/Documents/samplernn-pytorch-master/dataset.py", line 51, in iter for batch in super().iter(): File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 264, in next batch = self.collate_fn([self.dataset[i] for i in indices]) File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 115, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 353344 and 352320 in dimension 1 at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/TH/generic/THTensorMath.c:3586

koz4k commented 6 years ago

You most likely have chunks of not exactly equal length. Many tools for chunking audio files tend to do that. You can use ffmpeg, it cuts the files cleanly. See the downloading script for an example.

deepsound-project / samplernn-pytorch

Division by Zero when training #12