Different VAMPNET results from CPU/GPU training - Githubissues

deeptime-ml / deeptime

Python library for analysis of time series data including dimensionality reduction, clustering, and Markov model estimation

https://deeptime-ml.github.io/

GNU Lesser General Public License v3.0

757 stars 84 forks source link

Different VAMPNET results from CPU/GPU training #220

Closed yuxuanzhuang closed 2 years ago

yuxuanzhuang commented 2 years ago

Describe the bug I tried to run the ala2 notebook (https://github.com/deeptime-ml/deeptime-notebooks/blob/master/examples/ala2-example.ipynb) but ended up with quite different results with GPU v.s. CPU training. CPU had a much higher success rate and flat training curves compared to GPU. I am wondering if it is something common or if I had made any mistakes.

Results I tested with 10 individual runs with the same parameters as in the tutorial notebook.

CPU
GPU

System CPU: AMD EPYC 7551 GPU: RTX A5000 System: Ubuntu 20.04.1 Python 3.9 torch 1.11.0+cu113 deeptime '0.4.1+8.g38b0158.dirty' (main branch)

clonker commented 2 years ago

Oh wow that is quite different and not even one of the GPU results look good. I tried to reproduce this misbehavior with the same torch and deeptime versions but to no avail, the results look okay to me. But then again I don't have access to an A5000 right now. :slightly_smiling_face: is it possible for you to try on a different hardware? also have you tried rebooting the system? (sounds silly, i know, but sometimes the gpu cache / driver gets a bit confused)

yuxuanzhuang commented 2 years ago

I tested both GTX980Ti and RTX2080 with the same cuda/conda environment and they worked perfectly fine:) As I don't have the privilege to restart the system, I tried different nodes equipped with A5000 in our cluster and they failed consistently. I also tried torch.cuda.empty_cache() and it's not helping as well.

I will see if it is a compatibility issue with pytorch by testing out other deep learning benchmarks. I will get back to you if I find out what's wrong.

BTW: the other VAMPNET notebook (https://github.com/deeptime-ml/deeptime-notebooks/blob/master/vampnets.ipynb) has no issue running with GPU.

clonker commented 2 years ago

Hmmm. My best guess is the driver version that is not fully compatible (yet) with A5000s. But really I am not sure. Please do keep me posted! Thanks

yuxuanzhuang commented 2 years ago

Found the solution! The culprit is the TF32 tensor cores on new Ampere devices. I have to manually set torch.backends.cuda.matmul.allow_tf32 = False to increase the accuracy of eigensolver for Koopman. (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)

Related issue: https://github.com/pytorch/pytorch/labels/module%3A%20tf32

clonker commented 2 years ago

That is interesting. I should put up a warning in the vampnets documentation that this might be required. Good detective work!

yuxuanzhuang commented 2 years ago

Does it make sense to set it to False in default in the codebase given it is obviously not only failing ala2? Or add it as a context decorator only to the vamp_score in case people gain any performance elsewhere with this setting? I can take a stab at it.

clonker commented 2 years ago

I was thinking about that, too. I don't think setting it as a global default in the library is a good way to go about it, as it might effect the performance of other parts in a larger program (and silently so). I like the context manager idea! There is an issue with multi-threaded applications, but I don't think that is a big concern here. Looking forward to a PR!

yuxuanzhuang commented 2 years ago

It turns out tf32 needs to be disabled throughout training (and validating).

Results after fixing.

clonker commented 2 years ago

This might be a general problem with applications where precision does matter... for example with SO(3) equivarient nets etc. I am curious to see how things evolve.

davidgilbertson commented 2 years ago

In case you missed it, this tf32 setting is now false by default from PyTorch 1.12.

clonker commented 2 years ago

Thanks @davidgilbertson, i missed this indeed!