RuntimeError: Expected object of scalar type Float but got scalar type Int for argument #2 'target' When training forward network

jmasterx commented 4 years ago

Hi

I am using Windows 10, Pytorch 1.2, python 3.7 and all other required libs. I was able to generate the preprocessor data and fully train the tacotron model and generate the GTAs. But now I come to train the forward network and it looks like the dur parameter in for i, (x, m, ids, lens, dur) in enumerate(session.train_set, 1): is an int tensor and it expects a float tensor.

  File "train_forward.py", line 98, in <module>
    trainer.train(model, optimizer)
  File "D:\speech\ForwardTacotron-master\ForwardTacotron-master\trainer\forward_trainer.py", line 37, in train
    self.train_session(model, optimizer, session)
  File "D:\speech\ForwardTacotron-master\ForwardTacotron-master\trainer\forward_trainer.py", line 67, in train_session
    dur_loss = F.l1_loss(dur_hat, dur)
  File "C:\Users\Josh\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\functional.py", line 2165, in l1_loss
    ret = torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: Expected object of scalar type Float but got scalar type Int for argument #2 'target'

dur tensor looks like:

tensor([[ 0,  6, 14,  ...,  0,  0,  0],
        [ 0,  8,  9,  ...,  0,  0,  0],
        [ 0,  5,  8,  ...,  0,  0,  0],
        ...,
        [ 8,  9, 16,  ...,  0,  0,  0],
        [ 0,  8,  8,  ...,  0,  0,  0],
        [ 0,  6, 12,  ...,  0,  0,  0]], dtype=torch.int32)

Would you have any ideas what could cause this and how to address it?

Thank you

cschaefer26 commented 4 years ago

Hi, could be that the older pytorch does not autocast the values in l1_loss. I just pushed an explicit cast to master that should fix this, could you pull and try again?

jmasterx commented 4 years ago

Hi

Thank you for the fast response.

I pulled master and tried but got this:

Traceback (most recent call last):
  File "train_forward.py", line 98, in <module>
    trainer.train(model, optimizer)
  File "D:\speech\ForwardTacotron-master\ForwardTacotron-master\trainer\forward_trainer.py", line 37, in train
    self.train_session(model, optimizer, session)
  File "D:\speech\ForwardTacotron-master\ForwardTacotron-master\trainer\forward_trainer.py", line 71, in train_session
    loss.backward()
  File "C:\Users\Josh\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "C:\Users\Josh\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\autograd\__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: unspecified launch failure

Do I need to reprocess and retrain the first network or it should be working?

jmasterx commented 4 years ago

Hi

I think maybe I am running out of vram, I have an RTX 2070 with 8GB ram, I might need to lower batch size.

jmasterx commented 4 years ago

It is working in cpu mode, will try to reduce batch size. Thank you!

jmasterx commented 4 years ago

Batch size of 4 works, 8, 16, 32 do not. Does lower batch size affect quality or just takes longer to train?

jmasterx commented 4 years ago

hmmm. might not be a vram issue... even at 4 it does not get through an epoch before giving the same error as before hmmm.... vram usage only at 4GB...

jmasterx commented 4 years ago

Seems to be a bug in cudnn, disabling it is very slow but works https://github.com/pytorch/pytorch/issues/27588

jmasterx commented 4 years ago

adding torch.autograd.set_detect_anomaly(True) fixes the issue, but it is still a bit slower, but still much faster than disabling cudnn

cschaefer26 commented 4 years ago

I could imagine that upgrading pytorch/nvcc would help, but I understand that can be quite cumbersome.

jmasterx commented 4 years ago

Upgrading pytorch to 1.5.1 and getting latest nvidia drivers produced the same result. Does not seem to happen on 2080xx cards,just lower tier like mine. No big deal though, 1.5 steps/sec is much better than 0.26 without cudnn at all!

cschaefer26 commented 4 years ago

Sorry to hear that, 1.5steps/s is not bad, although it should be around 3 for batch size 32.

as-ideas / ForwardTacotron

RuntimeError: Expected object of scalar type Float but got scalar type Int for argument #2 'target' When training forward network #14