Closed mhillebrand closed 3 years ago
Hmm. I was using DGL 0.5.3 above (for its CUDA 11 support), but this project requires DGL 0.4? I just tried upgrading to DGL 0.6.0, and I'm now presented with a new error message:
Traceback (most recent call last):
File "train.py", line 381, in <module>
main()
File "train.py", line 377, in main
setup_training(model, train_loader, valid_loader, valid_dataset, hps)
File "train.py", line 71, in setup_training
run_training(model, train_loader, valid_loader, valset, hps, train_dir)
File "train.py", line 114, in run_training
outputs = model.forward(G) # [n_snodes, 2]
File "/home/matt/HeterSumGraph/HiGraph.py", line 201, in forward
doc_feature, snid2dnid = self.set_dnfeature(graph)
File "/home/matt/HeterSumGraph/HiGraph.py", line 239, in set_dnfeature
assert not torch.any(torch.isnan(doc_feature)), "doc_feature_element"
AssertionError: doc_feature_element
To get things working on the GPU, I had to change several lines of code in this project from G.to(device)
to G = G.to(device)
. Now I have a datatype assertion error, and when I debug, I see data types in my "cuda" graph like torch.float32
and torch.int64
. I'm new to PyTorch; could it be that these data types need to be torch.cuda.xxx
instead of torch.xxx
?
BTW, training on the CPU with DGL 0.6.0 does indeed work fine...but it's really slow (of course).
A new version of DGL was released today, 0.6.0post1, with CUDA 11.1 suppport, which appears to have solved my problems!
Whether I try train.py or evaluation.py with supplied checkpoints, I get the same error message:
Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal
Here's my
nvidia-smi
output: