tesla t4
amp opt level 01
torch 1.7.0
cuda 11.0
batch size 104
--warm_start
Epoch: 6
Train loss 174 0.163115 Grad Norm 0.404827 5.34s/it
Train loss 175 0.189592 Grad Norm 0.491446 4.42s/it
Traceback (most recent call last):
File "train.py", line 296, in <module>
args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
File "train.py", line 226, in train
scaled_loss.backward()
File "/home/black/.local/lib/python3.6/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/black/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
tesla t4 amp opt level 01 torch 1.7.0 cuda 11.0 batch size 104 --warm_start