Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal

mhillebrand commented 3 years ago

Whether I try train.py or evaluation.py with supplied checkpoints, I get the same error message: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal

$ python train.py --cuda --gpu 0 --data_dir ./datasets/multinews --cache_dir ./cache/MultiNews --embedding_path /opt/mr/embeddings/glove.840B.300d.txt --model HDSG --save_root ./save --log_root ./log --lr_descent --grad_clip -m 3

Using backend: pytorch
2021-03-07 17:54:52,953 INFO    : Pytorch 1.8.0+cu111
2021-03-07 17:54:52,953 INFO    : [INFO] Create Vocab, vocab path is ./cache/MultiNews/vocab
2021-03-07 17:54:52,986 INFO    : [INFO] max_size of vocab was specified as 50000; we now have 50000 words. Stopping reading.
2021-03-07 17:54:52,986 INFO    : [INFO] Finished constructing vocabulary of 50000 total words. Last word added: medicated
2021-03-07 17:54:53,077 INFO    : [INFO] Loading external word embedding...
^@2021-03-07 17:55:29,241 INFO    : [INFO] External Word Embedding iov count: 46079, oov count: 3921
2021-03-07 17:55:29,357 INFO    : Namespace(atten_dropout_prob=0.1, batch_size=32, bidirectional=True, cache_dir='./cache/MultiNews', cuda=True, data_dir='./datasets/multinews', doc_max_timesteps=50, embed_train=False, embedding_path='/opt/mr/embeddings/glove.840B.300d.txt', feat_embed_size=50, ffn_dropout_prob=0.1, ffn_inner_hidden_size=512, gpu='0', grad_clip=True, hidden_size=64, log_root='./log', lr=0.0005, lr_descent=True, lstm_hidden_state=128, lstm_layers=2, m=3, max_grad_norm=1.0, model='HDSG', n_epochs=20, n_feature_size=128, n_head=8, n_iter=1, n_layers=1, recurrent_dropout_prob=0.1, restore_model='None', save_root='./save', sent_max_len=100, use_orthnormal_init=True, vocab_size=50000, word_emb_dim=300, word_embedding=True)
2021-03-07 17:55:29,463 INFO    : [MODEL] HeterDocSumGraph 
2021-03-07 17:55:29,463 INFO    : [INFO] Start reading MultiExampleSet
2021-03-07 17:55:30,740 INFO    : [INFO] Finish reading MultiExampleSet. Total time is 1.277061, Total size is 44972
2021-03-07 17:55:30,740 INFO    : [INFO] Loading filter word File ./cache/MultiNews/filter_word.txt
2021-03-07 17:55:30,808 INFO    : [INFO] Loading word2sent TFIDF file from ./cache/MultiNews/train.w2s.tfidf.jsonl!
2021-03-07 17:55:37,931 INFO    : [INFO] Loading word2doc TFIDF file from ./cache/MultiNews/train.w2d.tfidf.jsonl!
2021-03-07 17:55:42,741 INFO    : [INFO] Start reading MultiExampleSet
2021-03-07 17:55:42,838 INFO    : [INFO] Finish reading MultiExampleSet. Total time is 0.097269, Total size is 5622
2021-03-07 17:55:42,839 INFO    : [INFO] Loading filter word File ./cache/MultiNews/filter_word.txt
2021-03-07 17:55:42,909 INFO    : [INFO] Loading word2sent TFIDF file from ./cache/MultiNews/val.w2s.tfidf.jsonl!
2021-03-07 17:55:43,825 INFO    : [INFO] Loading word2doc TFIDF file from ./cache/MultiNews/val.w2d.tfidf.jsonl!
2021-03-07 17:55:46,275 INFO    : [INFO] Use cuda
2021-03-07 17:55:46,275 INFO    : [INFO] Create new model for training...
2021-03-07 17:55:46,275 INFO    : [INFO] Starting run_training
Traceback (most recent call last):
  File "train.py", line 381, in <module>
    main()
  File "train.py", line 377, in main
    setup_training(model, train_loader, valid_loader, valid_dataset, hps)
  File "train.py", line 71, in setup_training
    run_training(model, train_loader, valid_loader, valset, hps, train_dir)
  File "train.py", line 114, in run_training
    outputs = model.forward(G)  # [n_snodes, 2]
  File "/home/matt/HeterSumGraph/HiGraph.py", line 201, in forward
    doc_feature, snid2dnid = self.set_dnfeature(graph)
  File "/home/matt/HeterSumGraph/HiGraph.py", line 237, in set_dnfeature
    snodes = [nid for nid in graph.predecessors(dnode) if graph.nodes[nid].data["dtype"]==1]
  File "/home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/heterograph.py", line 2647, in predecessors
    return self._graph.predecessors(self.get_etype_id(etype), v)
  File "/home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/heterograph_index.py", line 370, in predecessors
    self, int(etype), int(v)))
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 222, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 211, in dgl._ffi._cy3.core.FuncCall3
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [17:55:54] /opt/dgl/src/array/cuda/utils.cu:19: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal
Stack trace:
  [bt] (0) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f59f26abc8f]
  [bt] (1) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::cuda::AllTrue(signed char*, long, DLContext const&)+0x10f) [0x7f59f32f81ef]
  [bt] (2) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(std::pair<bool, bool> dgl::aten::impl::COOIsSorted<(DLDeviceType)2, long>(dgl::aten::COOMatrix)+0x9d) [0x7f59f2efaeed]
  [bt] (3) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::aten::COOIsSorted(dgl::aten::COOMatrix)+0x1e3) [0x7f59f2690893]
  [bt] (4) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DLDeviceType)2, long>(dgl::aten::COOMatrix)+0x4c8) [0x7f59f2ef9378]
  [bt] (5) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::aten::COOToCSR(dgl::aten::COOMatrix)+0x3f3) [0x7f59f268f553]
  [bt] (6) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::UnitGraph::GetInCSR(bool) const+0x300) [0x7f59f2e9d2e0]
  [bt] (7) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::UnitGraph::GetFormat(dgl::SparseFormat) const+0x4d) [0x7f59f2e9e25d]
  [bt] (8) /home/matt/anaconda3/envs/nlp/lib/python3.7/site-packages/dgl/libdgl.so(dgl::UnitGraph::Predecessors(unsigned long, unsigned long) const+0x34) [0x7f59f2e9e784]

Here's my nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:21:00.0  On |                  N/A |
| 30%   35C    P8    32W / 350W |    589MiB / 24265MiB |     24%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3090    Off  | 00000000:4A:00.0 Off |                  N/A |
|  0%   38C    P8    26W / 350W |      6MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1960      G   /usr/lib/xorg/Xorg                448MiB |
|    0   N/A  N/A      2917      G   cinnamon                           44MiB |
|    0   N/A  N/A      4333      G   ...AAAAAAAA== --shared-files       76MiB |
|    0   N/A  N/A     12990      G   ...oken=16001321251127579134       15MiB |
|    0   N/A  N/A     13177      G   /usr/bin/nvidia-settings            0MiB |
|    0   N/A  N/A     33039      G   /usr/bin/nvidia-settings            0MiB |
|    1   N/A  N/A      1960      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

mhillebrand commented 3 years ago

Hmm. I was using DGL 0.5.3 above (for its CUDA 11 support), but this project requires DGL 0.4? I just tried upgrading to DGL 0.6.0, and I'm now presented with a new error message:

Traceback (most recent call last):
  File "train.py", line 381, in <module>
    main()
  File "train.py", line 377, in main
    setup_training(model, train_loader, valid_loader, valid_dataset, hps)
  File "train.py", line 71, in setup_training
    run_training(model, train_loader, valid_loader, valset, hps, train_dir)
  File "train.py", line 114, in run_training
    outputs = model.forward(G)  # [n_snodes, 2]
  File "/home/matt/HeterSumGraph/HiGraph.py", line 201, in forward
    doc_feature, snid2dnid = self.set_dnfeature(graph)
  File "/home/matt/HeterSumGraph/HiGraph.py", line 239, in set_dnfeature
    assert not torch.any(torch.isnan(doc_feature)), "doc_feature_element"
AssertionError: doc_feature_element

mhillebrand commented 3 years ago

To get things working on the GPU, I had to change several lines of code in this project from G.to(device) to G = G.to(device). Now I have a datatype assertion error, and when I debug, I see data types in my "cuda" graph like torch.float32 and torch.int64. I'm new to PyTorch; could it be that these data types need to be torch.cuda.xxx instead of torch.xxx?

Debug

mhillebrand commented 3 years ago

BTW, training on the CPU with DGL 0.6.0 does indeed work fine...but it's really slow (of course).

mhillebrand commented 3 years ago

A new version of DGL was released today, 0.6.0post1, with CUDA 11.1 suppport, which appears to have solved my problems!

dqwang122 / HeterSumGraph

Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal #18