I tried a 90 epoch training for sanity check on a 3090 GPU, torch 1.13, and an exception just occurred:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 8, 256]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
So I just removed all the inplace=True in this repo(however, they are all Dropout instead of Relu mentioned in the Traceback), and the question is solved. Maybe it will slow a bit or not, but better than nothing.
In case someone may face the same problem, leaving a message here.
I tried a 90 epoch training for sanity check on a 3090 GPU, torch 1.13, and an exception just occurred:
So I just removed all the inplace=True in this repo(however, they are all Dropout instead of Relu mentioned in the Traceback), and the question is solved. Maybe it will slow a bit or not, but better than nothing.
In case someone may face the same problem, leaving a message here.