Closed yuxuanzhuang closed 2 years ago
Oh wow that is quite different and not even one of the GPU results look good. I tried to reproduce this misbehavior with the same torch and deeptime versions but to no avail, the results look okay to me. But then again I don't have access to an A5000
right now. :slightly_smiling_face: is it possible for you to try on a different hardware? also have you tried rebooting the system? (sounds silly, i know, but sometimes the gpu cache / driver gets a bit confused)
I tested both GTX980Ti and RTX2080 with the same cuda/conda environment and they worked perfectly fine:) As I don't have the privilege to restart the system, I tried different nodes equipped with A5000 in our cluster and they failed consistently. I also tried torch.cuda.empty_cache()
and it's not helping as well.
I will see if it is a compatibility issue with pytorch by testing out other deep learning benchmarks. I will get back to you if I find out what's wrong.
BTW: the other VAMPNET notebook (https://github.com/deeptime-ml/deeptime-notebooks/blob/master/vampnets.ipynb) has no issue running with GPU.
Hmmm. My best guess is the driver version that is not fully compatible (yet) with A5000s. But really I am not sure. Please do keep me posted! Thanks
Found the solution! The culprit is the TF32 tensor cores on new Ampere devices. I have to manually set torch.backends.cuda.matmul.allow_tf32 = False
to increase the accuracy of eigensolver for Koopman. (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
Related issue: https://github.com/pytorch/pytorch/labels/module%3A%20tf32
That is interesting. I should put up a warning in the vampnets documentation that this might be required. Good detective work!
Does it make sense to set it to False
in default in the codebase given it is obviously not only failing ala2? Or add it as a context decorator only to the vamp_score
in case people gain any performance elsewhere with this setting? I can take a stab at it.
I was thinking about that, too. I don't think setting it as a global default in the library is a good way to go about it, as it might effect the performance of other parts in a larger program (and silently so). I like the context manager idea! There is an issue with multi-threaded applications, but I don't think that is a big concern here. Looking forward to a PR!
It turns out tf32 needs to be disabled throughout training (and validating).
Results after fixing.
This might be a general problem with applications where precision does matter... for example with SO(3) equivarient nets etc. I am curious to see how things evolve.
In case you missed it, this tf32 setting is now false by default from PyTorch 1.12.
Thanks @davidgilbertson, i missed this indeed!
Describe the bug I tried to run the ala2 notebook (https://github.com/deeptime-ml/deeptime-notebooks/blob/master/examples/ala2-example.ipynb) but ended up with quite different results with GPU v.s. CPU training. CPU had a much higher success rate and flat training curves compared to GPU. I am wondering if it is something common or if I had made any mistakes.
Results I tested with 10 individual runs with the same parameters as in the tutorial notebook.
CPU
GPU
System CPU: AMD EPYC 7551 GPU: RTX A5000 System: Ubuntu 20.04.1 Python 3.9 torch 1.11.0+cu113 deeptime '0.4.1+8.g38b0158.dirty' (main branch)