Closed sbonner0 closed 2 years ago
Hi! This looks like a compilation error in CUDA head files and not in TorchDrug.
I just skim through cuComplex.h
and I think the cause is that the type float2 (a.k.a. cuFloatComplex) is not recognized correctly by the compiler. I guess the reason is either your GPU hardware doesn't support complex numbers or we miss some compilation flags to turn on the feature.
As we don't need to compile spmm/rspmm for complex tensors, maybe you can try to turn off the compilation for complex tensors in PyTorch JIT? I guess it might be some C++ macro but not sure what it is exactly.
Hey @KiddoZhu, thanks so much for the prompt response!
So the GPU is just a V100 so isn't anything exotic. I will try and mess with the JIT though and let you know how it goes. Is the pytorch 1.8.2 dependency required or could I also try a newer version?
We use V100 too, so it sounds weird to me. The code is mainly developed and tested on V100 + PyTorch 1.8.1 + CUDA 10.2. We also know it is good on A100 + PyTorch 1.8.2 LTS + CUDA 11.1.
If you run this code with PyTorch 1.10 or newer, it will consume slightly more memory for the 0-th GPU, and the default batch size will cause OOM for a 32GB V100 on some datasets. Other than this, we don't see any problem for newer PyTorch versions.
Hey @KiddoZhu I managed to solve this and it seemed to be some weird mismatch between conda and the native CUDA libraries installed on the HPC system. I now have it running using the python provided on the HPC and everything is working.
Thanks for your quick reply! I will now close the issue but I had a quick question about model check pointing -- are checkpoints only saved at the end of each epoch? If so, can this be changed to save after n update steps?
By the interface of TorchDrug, it is hard to dump checkpoint at the half of a epoch. But you can override the length of an epoch with the argument batch_per_epoch
in solver.train
. That can achieve a similar effect.
Hey,
Most likely this is an error with torch drug itself however when I try to run any of the examples from the readme, the code will crash with the following error:
This only occurs on a GPU linux machine, which is using CUDA 11.1 and GCC 10.3.
The conda env is as follows:
Any ideas how to get this to run?
Many thanks!