Open sef43 opened 1 year ago
Thanks for reporting the issue!
This is a problem of NVFuser. A bug report has been filed at https://github.com/pytorch/pytorch/issues/84510
The minimal reproducible example I extracted from the angular function is the following:
def angular_terms(Rca: float, ShfZ: Tensor, EtaA: Tensor, Zeta: Tensor,
ShfA: Tensor, vectors12: Tensor) -> Tensor:
vectors12 = vectors12.view(2, -1, 3, 1, 1, 1, 1)
cos_angles = vectors12.prod(0).sum(1)
ret = (cos_angles + ShfZ) * Zeta * ShfA * 2
return ret.flatten(start_dim=1)
Replace a ** operation with a torch.float_power will not solve the root cause of this problem.
At this moment, I would recommend disabling NVFuser by running the following:
torch._C._jit_set_nvfuser_enabled(False)
This will change to NNC fuser (https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/OVERVIEW.md#fusers) instead of nvfuser, which I tested is working correctly.
Hi, I have found that with pytorch 1.13 and 2.0 (not with pytorch<=1.12) the torch.jit.script profile guided optimisations (that are on by default) cause significant errors in the position gradients calculated via backpropagation of aev_computer when using a CUDA device. This is demonstrated in issue https://github.com/openmm/openmm-ml/issues/50.
An example is shown below, manually turning off the jit optimizations gives accurate forces:
output I get on an RTX3090 is:
I have found a workaround to remove the errors is to replace a
**
operation with atorch.float_power
: https://github.com/aiqm/torchani/commit/172b6fe85d3ab2acd3faa7a025b5aded22f2537c,