Open WUyinwei-hah opened 1 year ago
This seems to be related to blocks[i] = dgl.add_self_loop(blocks[i])
. @BarclayII Is there a suggested practice to add self loops to DGLBlocks?
Is add_self_loop
a part of the model code? If so, I don't think there is a substitute for adding self loops on blocks. One needs to manually add the edge pairs and call create_block
directly.
Or we could repurpose add_self_loop
to support DGLBlock as well.
Thanks for your attention and reply! The GMMConv
does not allow zero degree node in input graph, that's why I call add_self_loop
. But When I remove the add_self_loop
and set allow_zero_in_degree=True
in GMMConv
as test_monet1.py shown, I got new an illegal memory access
errors:
Traceback (most recent call last):
File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 107, in
Is that a problem related to CUDA? I only encounter the error when using GMMConv.
Is
add_self_loop
a part of the model code? If so, I don't think there is a substitute for adding self loops on blocks. One needs to manually add the edge pairs and callcreate_block
directly.Or we could repurpose
add_self_loop
to support DGLBlock as well.
Thank you, I tried add_self_loop
before feeding the graph into dgl.dataloading.DataLoader
, as shown in the third version of my code. The "Invalid vertex type: 1" does not exist, but an " CUDA error: an illegal memory access was encountered" error is occurred.
Traceback (most recent call last):
File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 108, in
Is
add_self_loop
a part of the model code? If so, I don't think there is a substitute for adding self loops on blocks. One needs to manually add the edge pairs and callcreate_block
directly. Or we could repurposeadd_self_loop
to support DGLBlock as well.Thank you, I tried
add_self_loop
before feeding the graph intodgl.dataloading.DataLoader
, as shown in the third version of my code. The "Invalid vertex type: 1" does not exist, but an " CUDA error: an illegal memory access was encountered" error is occurred.Traceback (most recent call last): File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 108, in output = model(blocks, batch_inputs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 47, in forward h = self.layers[i]( File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/nn/pytorch/conv/gmmconv.py", line 220, in forward if (graph.in_degrees() == 0).any(): RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Did you try CUDA_LAUNCH_BLOCKING=1
?
Is
add_self_loop
a part of the model code? If so, I don't think there is a substitute for adding self loops on blocks. One needs to manually add the edge pairs and callcreate_block
directly. Or we could repurposeadd_self_loop
to support DGLBlock as well.Thank you, I tried
add_self_loop
before feeding the graph intodgl.dataloading.DataLoader
, as shown in the third version of my code. The "Invalid vertex type: 1" does not exist, but an " CUDA error: an illegal memory access was encountered" error is occurred. Traceback (most recent call last): File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 108, in output = model(blocks, batch_inputs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 47, in forward h = self.layers[i]( File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/nn/pytorch/conv/gmmconv.py", line 220, in forward if (graph.in_degrees() == 0).any(): RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.Did you try
CUDA_LAUNCH_BLOCKING=1
?
Thanks for reminder.
The errors changed to:
Traceback (most recent call last):
File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 110, in
File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 45, in forward us, vs = blocks[i].edges(order='eid') File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/heterograph.py", line 3435, in in_degrees deg = self._graph.in_degrees(etid, v_tensor) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/heterograph_index.py", line 669, in in_degrees _CAPI_DGLHeteroInDegrees(self, int(etype), F.to_dgl_nd(v)) File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.call File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3 dgl._ffi.base.DGLError: [19:26:06] /opt/dgl/src/array/cuda/spmat_op_impl_csr.cu:163: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA kernel launch error: an illegal memory access was encountered Stack trace: [bt] (0) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fd85ef1b30f] [bt] (1) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray dgl::aten::impl::CSRGetRowNNZ<(DGLDeviceType)2, long>(dgl::aten::CSRMatrix, dgl::runtime::NDArray)+0x1f2) [0x7fd85fd4c122] [bt] (2) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::aten::CSRGetRowNNZ(dgl::aten::CSRMatrix, dgl::runtime::NDArray)+0x368) [0x7fd85ef001e8] [bt] (3) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::UnitGraph::CSR::OutDegrees(unsigned long, dgl::runtime::NDArray) const+0xb4) [0x7fd85f3a9a84] [bt] (4) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::UnitGraph::InDegrees(unsigned long, dgl::runtime::NDArray) const+0xe2) [0x7fd85f3a23a2] [bt] (5) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::HeteroGraph::InDegrees(unsigned long, dgl::runtime::NDArray) const+0x46) [0x7fd85f29f246] [bt] (6) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(+0x735309) [0x7fd85f2a9309] [bt] (7) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fd85f22ee58] [bt] (8) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x163fc) [0x7fd97ba043fc]
I have no clues for now. See if other people have any ideas.
Could you try running on CPU and see what is the error? CUDA error messages could be quite obscure sometimes and switching to CPU could give better error messages.
Could you try running on CPU and see what is the error? CUDA error messages could be quite obscure sometimes and switching to CPU could give better error messages.
When I switched to cpu execution, the error disappeared.
This probably means that it's a more subtle bug in CUDA. We will check out your code and see if it's reproducible. Also, when switching to CPU, did you notice any anomaly in training (NaN etc.)?
This probably means that it's a more subtle bug in CUDA. We will check out your code and see if it's reproducible. Also, when switching to CPU, did you notice any anomaly in training (NaN etc.)?
Sorry for ignoring that. When Switching to CPU, the GMMConv
layer output is NAN.
I am very sorry to disturb you again. Are we sure this is a bug now? Or is it just a mistake on my part?
Any thoughts? @BarclayII
@WUyinwei-hah not sure if this is a bug on our side or your side. Could you try detecting where the NaN is thrown? You could do so with torch.autograd.detect_anomaly
.
@WUyinwei-hah not sure if this is a bug on our side or your side. Could you try detecting where the NaN is thrown? You could do so with
torch.autograd.detect_anomaly
.
It seems that the NaN is thrown by LogSoftmaxBackward0
/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py:104: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
with autograd.detect_anomaly():
/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/dataloading/dataloader.py:869: DGLWarning: Dataloader CPU affinity opt is not enabled, consider switching it on (see enable_cpu_affinity() or CPU best practices for DGL [https://docs.dgl.ai/tutorials/cpu/cpu_best_practises.html])
dgl_warning(f'Dataloader CPU affinity opt is not enabled, consider switching it on '
/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in LogSoftmaxBackward0. Traceback of forward call that caused the error:
File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 117, in <module>
loss = F.cross_entropy(output, batch_labels)
File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/functional.py", line 3014, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:102.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 119, in <module>
loss.backward()
File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
Hmm, does output
have infinity or NaN values? And does batch_labels
look normal (like the values are all non-negative etc.)?
Hmm, does
output
have infinity or NaN values? And doesbatch_labels
look normal (like the values are all non-negative etc.)?
The output
are mostly nan
The batch_labels
and batch_inputs
looks quite normal:
I see, then NaN is occurring in the forward propagation. Could you locate which computation does not have NaN as input but outputs NaN? I'm not sure if PyTorch has a tool detecting NaNs in the forward pass, so you might need to print torch.isnan(...).any()
here and there.
@BarclayII
I ran following code:
us, vs = blocks[i].edges(order='eid')
udeg, vdeg = 1 / torch.sqrt(blocks[i].in_degrees(us).float()), 1 / torch.sqrt(blocks[i].in_degrees(vs).float())
and found blocks[i].in_degrees(us).float()
produced many huge numbers:
including negative numbers, so
but in contract, vdeg
looks normal.
I see the problem. block
is a directed bipartite graph, and the source nodes (i.e. us
) only have outgoing edges. So you should call blocks[i].out_degrees(us)
and blocks[i].in_degrees(vs)
.
I guess since there is no incoming edge type for the source nodes and we did not have a sanity check for that, it returned arbitrary content. We should add a sanity check for in_degrees
and out_degrees
.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
🐛 Bug
When I try to train MoNet with builtin GMMConv module, I got errors as below:
Traceback (most recent call last): File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 116, in <module> output = model(blocks, batch_inputs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 56, in forward h = self.layers[i]( File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/nn/pytorch/conv/gmmconv.py", line 220, in forward if (graph.in_degrees() == 0).any(): File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/heterograph.py", line 3433, in in_degrees v = self.dstnodes(dsttype) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/view.py", line 51, in __call__ self._graph._graph.number_of_nodes(ntid), File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/heterograph_index.py", line 376, in number_of_nodes return _CAPI_DGLHeteroNumVertices(self, int(ntype)) File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__ File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3 dgl._ffi.base.DGLError: [10:56:50] /opt/dgl/src/graph/./heterograph.h:67: Check failed: meta_graph_->HasVertex(vtype): Invalid vertex type: 1 Stack trace: [bt] (0) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fd7dd3ab30f] [bt] (1) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::HeteroGraph::NumVertices(unsigned long) const+0xa2) [0x7fd7dd72e162] [bt] (2) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(+0x733aed) [0x7fd7dd737aed] [bt] (3) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fd7dd6bee58] [bt] (4) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x163fc) [0x7fd8f9e943fc] [bt] (5) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x1692b) [0x7fd8f9e9492b] [bt] (6) /home/Wuyinwei/anaconda3/envs/ldgl/bin/python(_PyObject_MakeTpCall+0x3eb) [0x4d14db] [bt] (7) /home/Wuyinwei/anaconda3/envs/ldgl/bin/python(_PyEval_EvalFrameDefault+0x4f48) [0x4cc578] [bt] (8) /home/Wuyinwei/anaconda3/envs/ldgl/bin/python() [0x4e8af7]
I think it is a problem related to
MultiLayerFullNeighborSampler
orDataLoader
since I got no errors when training without neighbor sampling.To Reproduce
Steps to reproduce the behavior:
The code I ran is at: test_monet.py
Environment
Additional context