Closed stanleyshly closed 2 years ago
Hi @stanleyshly
I haven't tested the code with any PyTorch older than 1.5. I'm not sure why this is happening.
@imisra I located the cause of the error above, it was an inplace error in the dropout layer. I created a pull request that "fixes" the issue when running with PyTorch 1.10. However, it is slower unfortunately by a not insignificant amount(1.5x to 2x time per batch), it appears to be an error with inplace=True, I set to inplace=False, and it works just okay.
This happens with the latest pytorch (1.10), not an older version. The exact error (with set_detect_anomaly
true) is (minor changes because I broke apart a chained layer call to isolate which layer was causing the issue):
See some generic discussion about this: https://discuss.pytorch.org/t/solved-pytorch1-5-runtimeerror-one-of-the-variables-needed-for-gradient-computation-has-been-modified-by-an-inplace-operation/90256/17
Similar-ish issue fixed by hugging face: https://github.com/huggingface/transformers/pull/13613
I think it's due to this issue/PR in pytorch 1.10 https://github.com/pytorch/pytorch/pull/63089 https://github.com/pytorch/pytorch/issues/63027
There are also some trivial changes to pointnet required to get it to compile with the latest pytorch/cuda.
I saw @stanleyshly posted in the pytorch forums and there is some relevant discussion there: https://discuss.pytorch.org/t/inplace-errors-with-dropout-layers-with-pytorch-1-9-but-not-with-pytorch-1-10/137544
With Pytorch 1.9, I get no errors. However, with Pytorch 1.10, I get this error. RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 1, 256]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Even though Pytorch 1.10 isn't supported, I was wondering, did any large behavior change happen between pytorch 1.9 and 1.10? It seems odd that this error won't have been raised with 1.9.