Problems with RuntimeError:

nareto commented 4 years ago

Hello, I'm having trouble running the training on nvidia GPU. I always get the same error regarding the tensors not all being on the same device (cpu and gpu). I saw there were already similar issues, so I pulled in the changes two hours ago and retried, still same problem when running train.py (on YAGO dataset, but same results with WIKI).

Python is 3.6; with pytorch 1.4.0 (CUDA 10.1) I get

Traceback (most recent call last):
  File "train.py", line 256, in <module>
    train(args)
  File "train.py", line 184, in train
    ranks, loss = model.evaluate_filter(batch_data, (s_hist, s_hist_t), (o_hist, o_hist_t), global_model, total_data)
  File "/output/re-net/model.py", line 387, in evaluate_filter
    loss, sub_pred, ob_pred = self.predict(triplet, s_hist, o_hist, global_model)
  File "/output/re-net/model.py", line 239, in predict
    probs = prob_s * self.pred_r_rank2(ss, rr, subject=True)
  File "/output/re-net/model.py", line 193, in pred_r_rank2
    reverse=reverse)
  File "/output/re-net/Aggregator.py", line 178, in predict_batch
    s_len_non_zero, s_tem, r_tem, g, node_ids_graph, global_emb_list = get_s_r_embed_rgcn(s_hist, s, r, ent_embeds, graph_dict, global_emb)
  File "/output/re-net/utils.py", line 273, in get_s_r_embed_rgcn
    batched_graph = dgl.batch(g_list)
  File "/usr/local/lib/python3.6/dist-packages/dgl/graph.py", line 4189, in batch
    for key in node_attrs}
  File "/usr/local/lib/python3.6/dist-packages/dgl/graph.py", line 4189, in <dictcomp>
    for key in node_attrs}
  File "/usr/local/lib/python3.6/dist-packages/dgl/backend/pytorch/tensor.py", line 141, in cat
    return th.cat(seq, dim=dim)
RuntimeError: Expected object of backend CUDA but got backend CPU for sequence element 0 in sequence argument at position #1 'tensors'

with pytorch 1.5.1 (CUDA 10.1) at the same exact line in utils.py

 RuntimeError: All input tensors must be on the same device. Received cpu and cuda:0

CunchaoZ commented 4 years ago

I have the same question.

nareto commented 4 years ago

I think I may have found the crux of the issue, it's the lines in utils.py that call dgl.batch (in two places, around line 238 and 278).

Right before that line I added

g_list = [g.to(torch.device('cuda:0')) for g in g_list]

In a few hours I'll know if it worked

CunchaoZ commented 4 years ago

Thanks a lot! It works!

INK-USC / RE-Net

Problems with RuntimeError: #30