INK-USC / RE-Net

Recurrent Event Network: Autoregressive Structure Inference over Temporal Knowledge Graphs (EMNLP 2020)
http://inklab.usc.edu/renet/
435 stars 95 forks source link

Running on Google Colab: "dgl._ffi.base.DGLError: Cannot assign node feature..." #50

Closed davidshumway closed 2 years ago

davidshumway commented 2 years ago

Running on Google Colab, the following issue occurs in pretrrain.py. Perhaps it is an installation issue?

Traceback (most recent call last):
  File "pretrain.py", line 141, in <module>
    train(args)
  File "pretrain.py", line 85, in train
    loss = model(batch_data, true_s, true_o, graph_dict)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/gdrive/MyDrive/test/RE-Net/global_model.py", line 47, in forward
    packed_input = self.aggregator(sorted_t, self.ent_embeds, graph_dict, reverse=reverse)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/content/gdrive/MyDrive/test/RE-Net/Aggregator.py", line 54, in forward
    batched_graph.ndata['h'] = ent_embeds[batched_graph.ndata['id']].view(-1, ent_embeds.shape[1])
  File "/usr/local/lib/python3.7/dist-packages/dgl/view.py", line 81, in __setitem__
    self._graph._set_n_repr(self._ntid, self._nodes, {key : val})
  File "/usr/local/lib/python3.7/dist-packages/dgl/heterograph.py", line 3997, in _set_n_repr
    ' same device.'.format(key, F.context(val), self.device))
dgl._ffi.base.DGLError: Cannot assign node feature "h" on device cuda:0 to a graph on device cpu. Call DGLGraph.to() to copy the graph to the same device.
zjwu0522 commented 2 years ago

Hi, I encounterd the same problem. Have you solved it?

zjwu0522 commented 2 years ago

Hi,

I use the code graph_dict = {key:graph_dict[key].to('cuda:5') for key in graph_dict} after loading graph_dict (line 56 in pretrain.py) and the problem is solved.

However, now I set batchsize as 512 and the used gpu memory is more than 30GB. Considered the default batchsize is 1024 and the auther used GTX1080P for training, I am not sure it there are any problems.

davidshumway commented 2 years ago

Interesting, @zjwu0522!

But now, after updating pretrain.py, another issue is appearing:

Using backend: pytorch
Namespace(batch_size=1024, dataset='YAGO', dropout=0.5, gpu=0, grad_norm=1.0, lr=0.001, max_epochs=20, maxpool=1, model=3, n_hidden=200, num_k=10, rnn_layers=1, seq_len=10)
start training...
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: invalid device ordinal
Exception raised from exchangeDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f4d61aad1e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xfb52 (0x7f4d61cf4b52 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0xf723c0 (0x7f4d62e883c0 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xf512f3 (0x7f4d62e672f3 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xf6b157 (0x7f4d62e81157 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x10e9c7d (0x7f4d99bebc7d in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x10e9f97 (0x7f4d99bebf97 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0xfa (0x7f4d99cf6a1a in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>)::{lambda()#1}::operator()() const + 0x78 (0x7f4d3c1567c0 in /usr/local/lib/python3.7/dist-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.6.0.so)
frame #9: torch::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x51 (0x7f4d3c156844 in /usr/local/lib/python3.7/dist-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.6.0.so)
frame #10: TAempty + 0x129 (0x7f4d3c152f86 in /usr/local/lib/python3.7/dist-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.6.0.so)
frame #11: dgl::runtime::NDArray::Empty(std::vector<long, std::allocator<long> >, DLDataType, DLContext) + 0xb6 (0x7f4d4defa0c6 in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #12: dgl::runtime::NDArray::CopyTo(DLContext const&) const + 0xc0 (0x7f4d4df31560 in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #13: dgl::aten::COOMatrix::CopyTo(DLContext const&) const + 0x7d (0x7f4d4e020ddd in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #14: dgl::UnitGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&) + 0x292 (0x7f4d4e011562 in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #15: dgl::HeteroGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&) + 0xf5 (0x7f4d4df42785 in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #16: <unknown function> + 0xcc081b (0x7f4d4df4f81b in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #17: DGLFuncCall + 0x48 (0x7f4d4dede228 in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #18: <unknown function> + 0x16623 (0x7f4d38946623 in /usr/local/lib/python3.7/dist-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so)
frame #19: <unknown function> + 0x1694b (0x7f4d3894694b in /usr/local/lib/python3.7/dist-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>
frame #40: __libc_start_main + 0xe7 (0x7f4dbab3abf7 in /lib/x86_64-linux-gnu/libc.so.6)
davidshumway commented 2 years ago

However, after setting "cuda:5" to "cuda:0" it appears to train now as expected

zhangjinyu19980915 commented 2 years ago

Interesting, @zjwu0522!

But now, after updating pretrain.py, another issue is appearing:

Using backend: pytorch
Namespace(batch_size=1024, dataset='YAGO', dropout=0.5, gpu=0, grad_norm=1.0, lr=0.001, max_epochs=20, maxpool=1, model=3, n_hidden=200, num_k=10, rnn_layers=1, seq_len=10)
start training...
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: invalid device ordinal
Exception raised from exchangeDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f4d61aad1e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xfb52 (0x7f4d61cf4b52 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0xf723c0 (0x7f4d62e883c0 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xf512f3 (0x7f4d62e672f3 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xf6b157 (0x7f4d62e81157 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x10e9c7d (0x7f4d99bebc7d in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x10e9f97 (0x7f4d99bebf97 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #7: at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0xfa (0x7f4d99cf6a1a in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>)::{lambda()#1}::operator()() const + 0x78 (0x7f4d3c1567c0 in /usr/local/lib/python3.7/dist-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.6.0.so)
frame #9: torch::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x51 (0x7f4d3c156844 in /usr/local/lib/python3.7/dist-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.6.0.so)
frame #10: TAempty + 0x129 (0x7f4d3c152f86 in /usr/local/lib/python3.7/dist-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.6.0.so)
frame #11: dgl::runtime::NDArray::Empty(std::vector<long, std::allocator<long> >, DLDataType, DLContext) + 0xb6 (0x7f4d4defa0c6 in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #12: dgl::runtime::NDArray::CopyTo(DLContext const&) const + 0xc0 (0x7f4d4df31560 in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #13: dgl::aten::COOMatrix::CopyTo(DLContext const&) const + 0x7d (0x7f4d4e020ddd in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #14: dgl::UnitGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&) + 0x292 (0x7f4d4e011562 in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #15: dgl::HeteroGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&) + 0xf5 (0x7f4d4df42785 in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #16: <unknown function> + 0xcc081b (0x7f4d4df4f81b in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #17: DGLFuncCall + 0x48 (0x7f4d4dede228 in /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so)
frame #18: <unknown function> + 0x16623 (0x7f4d38946623 in /usr/local/lib/python3.7/dist-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so)
frame #19: <unknown function> + 0x1694b (0x7f4d3894694b in /usr/local/lib/python3.7/dist-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so)
<omitting python frames>
frame #40: __libc_start_main + 0xe7 (0x7f4dbab3abf7 in /lib/x86_64-linux-gnu/libc.so.6)

Hi, I encounterd the similar problem in train.py, Have you solved it? Traceback (most recent call last): File "/home/zjy/myprograms/RE-Net-master/train.py", line 243, in train(args) File "/home/zjy/myprograms/RE-Net-master/train.py", line 173, in train ranks, loss = model.evaluate_filter(batch_data, (s_hist, s_hist_t), (o_hist, o_hist_t), global_model, total_data) File "/home/zjy/myprograms/RE-Net-master/model.py", line 388, in evaluate_filter loss, sub_pred, ob_pred = self.predict(triplet, s_hist, o_hist, globalmodel) File "/home/zjy/myprograms/RE-Net-master/model.py", line 223, in predict , sub, prob_sub = global_model.predict(self.latest_time, self.graph_dict, subject=True) File "/home/zjy/myprograms/RE-Net-master/global_model.py", line 88, in predict rnn_inp = self.aggregator.predict(t, self.ent_embeds, graph_dict, reverse=reverse) File "/home/zjy/myprograms/RE-Net-master/Aggregator.py", line 93, in predict move_dgl_to_cuda(graph_dict[tim.item()]) File "/home/zjy/myprograms/RE-Net-master/utils.py", line 141, in move_dgl_to_cuda g.ndata.update({k: cuda(g.ndata[k]) for k in g.ndata}) File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/_collections_abc.py", line 941, in update self[key] = other[key] File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/view.py", line 81, in setitem self._graph._set_n_repr(self._ntid, self._nodes, {key : val}) File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/heterograph.py", line 4113, in _set_n_repr raise DGLError('Cannot assign node feature "{}" on device {} to a graph on' dgl._ffi.base.DGLError: Cannot assign node feature "id" on device cuda:0 to a graph on device cpu. Call DGLGraph.to() to copy the graph to the same device.

zhangjinyu19980915 commented 2 years ago

Hi,

I use the code graph_dict = {key:graph_dict[key].to('cuda:5') for key in graph_dict} after loading graph_dict (line 56 in pretrain.py) and the problem is solved.

However, now I set batchsize as 512 and the used gpu memory is more than 30GB. Considered the default batchsize is 1024 and the auther used GTX1080P for training, I am not sure it there are any problems.

Hi, I encounterd the same problem. Have you solved it?

Hi, I encounterd the similar problem in train.py, Have you solved it? Traceback (most recent call last): File "/home/zjy/myprograms/RE-Net-master/train.py", line 243, in train(args) File "/home/zjy/myprograms/RE-Net-master/train.py", line 173, in train ranks, loss = model.evaluate_filter(batch_data, (s_hist, s_hist_t), (o_hist, o_hist_t), global_model, total_data) File "/home/zjy/myprograms/RE-Net-master/model.py", line 388, in evaluate_filter loss, sub_pred, ob_pred = self.predict(triplet, s_hist, o_hist, globalmodel) File "/home/zjy/myprograms/RE-Net-master/model.py", line 223, in predict , sub, prob_sub = global_model.predict(self.latest_time, self.graph_dict, subject=True) File "/home/zjy/myprograms/RE-Net-master/global_model.py", line 88, in predict rnn_inp = self.aggregator.predict(t, self.ent_embeds, graph_dict, reverse=reverse) File "/home/zjy/myprograms/RE-Net-master/Aggregator.py", line 93, in predict move_dgl_to_cuda(graph_dict[tim.item()]) File "/home/zjy/myprograms/RE-Net-master/utils.py", line 141, in move_dgl_to_cuda g.ndata.update({k: cuda(g.ndata[k]) for k in g.ndata}) File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/_collections_abc.py", line 941, in update self[key] = other[key] File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/view.py", line 81, in setitem self._graph._set_n_repr(self._ntid, self._nodes, {key : val}) File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/heterograph.py", line 4113, in _set_n_repr raise DGLError('Cannot assign node feature "{}" on device {} to a graph on' dgl._ffi.base.DGLError: Cannot assign node feature "id" on device cuda:0 to a graph on device cpu. Call DGLGraph.to() to copy the graph to the same device

davidshumway commented 2 years ago

Hi, I use the code graph_dict = {key:graph_dict[key].to('cuda:5') for key in graph_dict} after loading graph_dict (line 56 in pretrain.py) and the problem is solved. However, now I set batchsize as 512 and the used gpu memory is more than 30GB. Considered the default batchsize is 1024 and the auther used GTX1080P for training, I am not sure it there are any problems.

Hi, I encounterd the same problem. Have you solved it?

Hi, I encounterd the similar problem in train.py, Have you solved it? Traceback (most recent call last): File "/home/zjy/myprograms/RE-Net-master/train.py", line 243, in train(args) File "/home/zjy/myprograms/RE-Net-master/train.py", line 173, in train ranks, loss = model.evaluate_filter(batch_data, (s_hist, s_hist_t), (o_hist, o_hist_t), global_model, total_data) File "/home/zjy/myprograms/RE-Net-master/model.py", line 388, in evaluate_filter loss, sub_pred, ob_pred = self.predict(triplet, s_hist, o_hist, globalmodel) File "/home/zjy/myprograms/RE-Net-master/model.py", line 223, in predict , sub, prob_sub = global_model.predict(self.latest_time, self.graph_dict, subject=True) File "/home/zjy/myprograms/RE-Net-master/global_model.py", line 88, in predict rnn_inp = self.aggregator.predict(t, self.ent_embeds, graph_dict, reverse=reverse) File "/home/zjy/myprograms/RE-Net-master/Aggregator.py", line 93, in predict move_dgl_to_cuda(graph_dict[tim.item()]) File "/home/zjy/myprograms/RE-Net-master/utils.py", line 141, in move_dgl_to_cuda g.ndata.update({k: cuda(g.ndata[k]) for k in g.ndata}) File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/_collections_abc.py", line 941, in update self[key] = other[key] File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/view.py", line 81, in setitem self._graph._set_n_repr(self._ntid, self._nodes, {key : val}) File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/heterograph.py", line 4113, in _set_n_repr raise DGLError('Cannot assign node feature "{}" on device {} to a graph on' dgl._ffi.base.DGLError: Cannot assign node feature "id" on device cuda:0 to a graph on device cpu. Call DGLGraph.to() to copy the graph to the same device

This works: https://github.com/INK-USC/RE-Net/issues/50#issuecomment-964130940

zhangjinyu19980915 commented 2 years ago

Hi, I use the code graph_dict = {key:graph_dict[key].to('cuda:5') for key in graph_dict} after loading graph_dict (line 56 in pretrain.py) and the problem is solved. However, now I set batchsize as 512 and the used gpu memory is more than 30GB. Considered the default batchsize is 1024 and the auther used GTX1080P for training, I am not sure it there are any problems.

Hi, I encounterd the same problem. Have you solved it?

Hi, I encounterd the similar problem in train.py, Have you solved it? Traceback (most recent call last): File "/home/zjy/myprograms/RE-Net-master/train.py", line 243, in train(args) File "/home/zjy/myprograms/RE-Net-master/train.py", line 173, in train ranks, loss = model.evaluate_filter(batch_data, (s_hist, s_hist_t), (o_hist, o_hist_t), global_model, total_data) File "/home/zjy/myprograms/RE-Net-master/model.py", line 388, in evaluate_filter loss, sub_pred, ob_pred = self.predict(triplet, s_hist, o_hist, globalmodel) File "/home/zjy/myprograms/RE-Net-master/model.py", line 223, in predict , sub, prob_sub = global_model.predict(self.latest_time, self.graph_dict, subject=True) File "/home/zjy/myprograms/RE-Net-master/global_model.py", line 88, in predict rnn_inp = self.aggregator.predict(t, self.ent_embeds, graph_dict, reverse=reverse) File "/home/zjy/myprograms/RE-Net-master/Aggregator.py", line 93, in predict move_dgl_to_cuda(graph_dict[tim.item()]) File "/home/zjy/myprograms/RE-Net-master/utils.py", line 141, in move_dgl_to_cuda g.ndata.update({k: cuda(g.ndata[k]) for k in g.ndata}) File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/_collections_abc.py", line 941, in update self[key] = other[key] File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/view.py", line 81, in setitem self._graph._set_n_repr(self._ntid, self._nodes, {key : val}) File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/heterograph.py", line 4113, in _set_n_repr raise DGLError('Cannot assign node feature "{}" on device {} to a graph on' dgl._ffi.base.DGLError: Cannot assign node feature "id" on device cuda:0 to a graph on device cpu. Call DGLGraph.to() to copy the graph to the same device

This works: #50 (comment)

Thank you for your reply. However, when I solved this problem in "pretrain.py" with your help, another problem came up in "train.py". The node feature is "id",not "h". How can I solve this problem. I look forward to your reply. Traceback (most recent call last): File "/home/zjy/myprograms/RE-Net/train.py", line 241, in train(args) File "/home/zjy/myprograms/RE-Net/train.py", line 172, in train ranks, loss = model.evaluate_filter(batch_data, (s_hist, s_hist_t), (o_hist, o_hist_t), global_model, total_data) File "/home/zjy/myprograms/RE-Net/model.py", line 388, in evaluate_filter loss, sub_pred, ob_pred = self.predict(triplet, s_hist, o_hist, globalmodel) File "/home/zjy/myprograms/RE-Net/model.py", line 223, in predict , sub, prob_sub = global_model.predict(self.latest_time, self.graph_dict, subject=True) File "/home/zjy/myprograms/RE-Net/global_model.py", line 88, in predict rnn_inp = self.aggregator.predict(t, self.ent_embeds, graph_dict, reverse=reverse) File "/home/zjy/myprograms/RE-Net/Aggregator.py", line 93, in predict move_dgl_to_cuda(graph_dict[tim.item()]) File "/home/zjy/myprograms/RE-Net/utils.py", line 141, in move_dgl_to_cuda g.ndata.update({k: cuda(g.ndata[k]) for k in g.ndata}) File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/_collections_abc.py", line 941, in update self[key] = other[key] File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/view.py", line 81, in setitem self._graph._set_n_repr(self._ntid, self._nodes, {key : val}) File "/home/zjy/miniconda3/envs/dgl/lib/python3.9/site-packages/dgl/heterograph.py", line 4113, in _set_n_repr raise DGLError('Cannot assign node feature "{}" on device {} to a graph on' dgl._ffi.base.DGLError: Cannot assign node feature "id" on device cuda:0 to a graph on device cpu. Call DGLGraph.to() to copy the graph to the same device.

woojeongjin commented 2 years ago

Hi! This might be a DGL version issue. What version do you use? Can you try 0.4.3post2 version?