Closed jmasterx closed 4 years ago
Hi, could be that the older pytorch does not autocast the values in l1_loss. I just pushed an explicit cast to master that should fix this, could you pull and try again?
Hi
Thank you for the fast response.
I pulled master and tried but got this:
Traceback (most recent call last):
File "train_forward.py", line 98, in <module>
trainer.train(model, optimizer)
File "D:\speech\ForwardTacotron-master\ForwardTacotron-master\trainer\forward_trainer.py", line 37, in train
self.train_session(model, optimizer, session)
File "D:\speech\ForwardTacotron-master\ForwardTacotron-master\trainer\forward_trainer.py", line 71, in train_session
loss.backward()
File "C:\Users\Josh\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\Josh\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\autograd\__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: unspecified launch failure
Do I need to reprocess and retrain the first network or it should be working?
Hi
I think maybe I am running out of vram, I have an RTX 2070 with 8GB ram, I might need to lower batch size.
It is working in cpu mode, will try to reduce batch size. Thank you!
Batch size of 4 works, 8, 16, 32 do not. Does lower batch size affect quality or just takes longer to train?
hmmm. might not be a vram issue... even at 4 it does not get through an epoch before giving the same error as before hmmm.... vram usage only at 4GB...
Seems to be a bug in cudnn, disabling it is very slow but works https://github.com/pytorch/pytorch/issues/27588
adding torch.autograd.set_detect_anomaly(True) fixes the issue, but it is still a bit slower, but still much faster than disabling cudnn
I could imagine that upgrading pytorch/nvcc would help, but I understand that can be quite cumbersome.
Upgrading pytorch to 1.5.1 and getting latest nvidia drivers produced the same result. Does not seem to happen on 2080xx cards,just lower tier like mine. No big deal though, 1.5 steps/sec is much better than 0.26 without cudnn at all!
Sorry to hear that, 1.5steps/s is not bad, although it should be around 3 for batch size 32.
Hi
I am using Windows 10, Pytorch 1.2, python 3.7 and all other required libs. I was able to generate the preprocessor data and fully train the tacotron model and generate the GTAs. But now I come to train the forward network and it looks like the dur parameter in
for i, (x, m, ids, lens, dur) in enumerate(session.train_set, 1):
is an int tensor and it expects a float tensor.dur tensor looks like:
Would you have any ideas what could cause this and how to address it?
Thank you