Encounter "Invalid vertex type: 1" error when training MoNet

WUyinwei-hah commented 1 year ago

🐛 Bug

When I try to train MoNet with builtin GMMConv module, I got errors as below: Traceback (most recent call last): File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 116, in <module> output = model(blocks, batch_inputs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 56, in forward h = self.layers[i]( File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/nn/pytorch/conv/gmmconv.py", line 220, in forward if (graph.in_degrees() == 0).any(): File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/heterograph.py", line 3433, in in_degrees v = self.dstnodes(dsttype) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/view.py", line 51, in __call__ self._graph._graph.number_of_nodes(ntid), File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/heterograph_index.py", line 376, in number_of_nodes return _CAPI_DGLHeteroNumVertices(self, int(ntype)) File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__ File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3 dgl._ffi.base.DGLError: [10:56:50] /opt/dgl/src/graph/./heterograph.h:67: Check failed: meta_graph_->HasVertex(vtype): Invalid vertex type: 1 Stack trace: [bt] (0) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fd7dd3ab30f] [bt] (1) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::HeteroGraph::NumVertices(unsigned long) const+0xa2) [0x7fd7dd72e162] [bt] (2) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(+0x733aed) [0x7fd7dd737aed] [bt] (3) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fd7dd6bee58] [bt] (4) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x163fc) [0x7fd8f9e943fc] [bt] (5) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x1692b) [0x7fd8f9e9492b] [bt] (6) /home/Wuyinwei/anaconda3/envs/ldgl/bin/python(_PyObject_MakeTpCall+0x3eb) [0x4d14db] [bt] (7) /home/Wuyinwei/anaconda3/envs/ldgl/bin/python(_PyEval_EvalFrameDefault+0x4f48) [0x4cc578] [bt] (8) /home/Wuyinwei/anaconda3/envs/ldgl/bin/python() [0x4e8af7]

I think it is a problem related to MultiLayerFullNeighborSampler or DataLoader since I got no errors when training without neighbor sampling.

To Reproduce

Steps to reproduce the behavior:

The code I ran is at: test_monet.py

Environment

DGL Version: dgl 1.0.2+cu116 dglgo 0.0.2
Backend Library & Version: torch 1.12.1+cu116 torch-geometric 2.1.0
OS: linux ubuntu 20
How you installed DGL: pip install dgl -f https://data.dgl.ai/wheels/cu116/repo.html pip install dglgo -f https://data.dgl.ai/wheels-test/repo.html
Python version: 3.8
CUDA/cuDNN version (if applicable): 11.3
GPU models and configuration: RTX3090
Any other relevant information:

Additional context

mufeili commented 1 year ago

This seems to be related to blocks[i] = dgl.add_self_loop(blocks[i]). @BarclayII Is there a suggested practice to add self loops to DGLBlocks?

BarclayII commented 1 year ago

Is add_self_loop a part of the model code? If so, I don't think there is a substitute for adding self loops on blocks. One needs to manually add the edge pairs and call create_block directly.

Or we could repurpose add_self_loop to support DGLBlock as well.

WUyinwei-hah commented 1 year ago

Thanks for your attention and reply! The GMMConv does not allow zero degree node in input graph, that's why I call add_self_loop. But When I remove the add_self_loop and set allow_zero_in_degree=True in GMMConv as test_monet1.py shown, I got new an illegal memory access errors:

Traceback (most recent call last): File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 107, in output = model(blocks, batch_inputs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 47, in forward h = self.layers[i]( File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, *kwargs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/nn/pytorch/conv/gmmconv.py", line 240, in forward graph.update_all(fn.u_mul_e('h', 'w', 'm'), self._reducer('m', 'h')) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/heterograph.py", line 4781, in update_all ndata = core.message_passing(g, message_func, reduce_func, apply_node_func) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/core.py", line 390, in message_passing ndata = invoke_gspmm(g, mfunc, rfunc) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/core.py", line 351, in invoke_gspmm z = op(graph, x, y) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/ops/spmm.py", line 173, in func return gspmm(g, binary_op, reduce_op, x, y) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/ops/spmm.py", line 77, in gspmm ret = gspmm_internal( File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/backend/pytorch/sparse.py", line 1032, in gspmm return GSpMM.apply(args) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/backend/pytorch/sparse.py", line 165, in forward out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/_sparse_ops.py", line 239, in _gspmm _CAPI_DGLKernelSpMM( File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.call File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall dgl._ffi.base.DGLError: [18:38:33] /opt/dgl/src/array/cuda/./spmm.cuh:724: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered Stack trace: [bt] (0) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f67640d830f] [bt] (1) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(void dgl::aten::cuda::SpMMCsr<long, float, dgl::aten::cuda::binary::Mul, dgl::aten::cuda::reduce::Sum<long, float, false> >(dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray)+0x62b) [0x7f676500733b] [bt] (2) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(void dgl::aten::SpMMCsr<2, long, float>(std::string const&, std::string const&, dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator >)+0xc34) [0x7f676506f2c4] [bt] (3) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::aten::SpMM(std::string const&, std::string const&, std::shared_ptr, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator >)+0x217e) [0x7f6764376a8e] [bt] (4) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(+0x662680) [0x7f6764393680] [bt] (5) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(+0x662c41) [0x7f6764393c41] [bt] (6) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f67643ebe58] [bt] (7) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x16150) [0x7f6880bc1150] [bt] (8) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x1692b) [0x7f6880bc192b]

Is that a problem related to CUDA? I only encounter the error when using GMMConv.

WUyinwei-hah commented 1 year ago

Is add_self_loop a part of the model code? If so, I don't think there is a substitute for adding self loops on blocks. One needs to manually add the edge pairs and call create_block directly.

Or we could repurpose add_self_loop to support DGLBlock as well.

Thank you, I tried add_self_loop before feeding the graph into dgl.dataloading.DataLoader, as shown in the third version of my code. The "Invalid vertex type: 1" does not exist, but an " CUDA error: an illegal memory access was encountered" error is occurred.

Traceback (most recent call last): File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 108, in output = model(blocks, batch_inputs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 47, in forward h = self.layers[i]( File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/nn/pytorch/conv/gmmconv.py", line 220, in forward if (graph.in_degrees() == 0).any(): RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

mufeili commented 1 year ago

Is add_self_loop a part of the model code? If so, I don't think there is a substitute for adding self loops on blocks. One needs to manually add the edge pairs and call create_block directly. Or we could repurpose add_self_loop to support DGLBlock as well.

Thank you, I tried add_self_loop before feeding the graph into dgl.dataloading.DataLoader, as shown in the third version of my code. The "Invalid vertex type: 1" does not exist, but an " CUDA error: an illegal memory access was encountered" error is occurred.

Traceback (most recent call last): File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 108, in output = model(blocks, batch_inputs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 47, in forward h = self.layers[i]( File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/nn/pytorch/conv/gmmconv.py", line 220, in forward if (graph.in_degrees() == 0).any(): RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Did you try CUDA_LAUNCH_BLOCKING=1?

WUyinwei-hah commented 1 year ago

Is add_self_loop a part of the model code? If so, I don't think there is a substitute for adding self loops on blocks. One needs to manually add the edge pairs and call create_block directly. Or we could repurpose add_self_loop to support DGLBlock as well.

Thank you, I tried add_self_loop before feeding the graph into dgl.dataloading.DataLoader, as shown in the third version of my code. The "Invalid vertex type: 1" does not exist, but an " CUDA error: an illegal memory access was encountered" error is occurred. Traceback (most recent call last): File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 108, in output = model(blocks, batch_inputs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 47, in forward h = self.layers[i]( File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, **kwargs) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/nn/pytorch/conv/gmmconv.py", line 220, in forward if (graph.in_degrees() == 0).any(): RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Did you try CUDA_LAUNCH_BLOCKING=1?

Thanks for reminder.

The errors changed to:

Traceback (most recent call last): File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 110, in

File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 45, in forward us, vs = blocks[i].edges(order='eid') File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/heterograph.py", line 3435, in in_degrees deg = self._graph.in_degrees(etid, v_tensor) File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/heterograph_index.py", line 669, in in_degrees _CAPI_DGLHeteroInDegrees(self, int(etype), F.to_dgl_nd(v)) File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.call File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3 dgl._ffi.base.DGLError: [19:26:06] /opt/dgl/src/array/cuda/spmat_op_impl_csr.cu:163: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA kernel launch error: an illegal memory access was encountered Stack trace: [bt] (0) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fd85ef1b30f] [bt] (1) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray dgl::aten::impl::CSRGetRowNNZ<(DGLDeviceType)2, long>(dgl::aten::CSRMatrix, dgl::runtime::NDArray)+0x1f2) [0x7fd85fd4c122] [bt] (2) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::aten::CSRGetRowNNZ(dgl::aten::CSRMatrix, dgl::runtime::NDArray)+0x368) [0x7fd85ef001e8] [bt] (3) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::UnitGraph::CSR::OutDegrees(unsigned long, dgl::runtime::NDArray) const+0xb4) [0x7fd85f3a9a84] [bt] (4) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::UnitGraph::InDegrees(unsigned long, dgl::runtime::NDArray) const+0xe2) [0x7fd85f3a23a2] [bt] (5) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::HeteroGraph::InDegrees(unsigned long, dgl::runtime::NDArray) const+0x46) [0x7fd85f29f246] [bt] (6) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(+0x735309) [0x7fd85f2a9309] [bt] (7) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fd85f22ee58] [bt] (8) /home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x163fc) [0x7fd97ba043fc]

mufeili commented 1 year ago

I have no clues for now. See if other people have any ideas.

BarclayII commented 1 year ago

Could you try running on CPU and see what is the error? CUDA error messages could be quite obscure sometimes and switching to CPU could give better error messages.

WUyinwei-hah commented 1 year ago

Could you try running on CPU and see what is the error? CUDA error messages could be quite obscure sometimes and switching to CPU could give better error messages.

When I switched to cpu execution, the error disappeared.

BarclayII commented 1 year ago

This probably means that it's a more subtle bug in CUDA. We will check out your code and see if it's reproducible. Also, when switching to CPU, did you notice any anomaly in training (NaN etc.)?

WUyinwei-hah commented 1 year ago

This probably means that it's a more subtle bug in CUDA. We will check out your code and see if it's reproducible. Also, when switching to CPU, did you notice any anomaly in training (NaN etc.)?

Sorry for ignoring that. When Switching to CPU, the GMMConv layer output is NAN.

WUyinwei-hah commented 1 year ago

I am very sorry to disturb you again. Are we sure this is a bug now? Or is it just a mistake on my part?

mufeili commented 1 year ago

Any thoughts? @BarclayII

BarclayII commented 1 year ago

@WUyinwei-hah not sure if this is a bug on our side or your side. Could you try detecting where the NaN is thrown? You could do so with torch.autograd.detect_anomaly.

WUyinwei-hah commented 1 year ago

@WUyinwei-hah not sure if this is a bug on our side or your side. Could you try detecting where the NaN is thrown? You could do so with torch.autograd.detect_anomaly.

It seems that the NaN is thrown by LogSoftmaxBackward0

/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py:104: UserWarning: Anomaly Detection has been enabled. This mode will increase the runtime and should only be enabled for debugging.
  with autograd.detect_anomaly():
/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/dgl/dataloading/dataloader.py:869: DGLWarning: Dataloader CPU affinity opt is not enabled, consider switching it on (see enable_cpu_affinity() or CPU best practices for DGL [https://docs.dgl.ai/tutorials/cpu/cpu_best_practises.html])
  dgl_warning(f'Dataloader CPU affinity opt is not enabled, consider switching it on '
/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in LogSoftmaxBackward0. Traceback of forward call that caused the error:
  File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 117, in <module>
    loss = F.cross_entropy(output, batch_labels)
  File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/nn/functional.py", line 3014, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:102.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/home/Wuyinwei/Desktop/Distill_GCN_Experiment/test_monet.py", line 119, in <module>
    loss.backward()
  File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/Wuyinwei/anaconda3/envs/ldgl/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

BarclayII commented 1 year ago

Hmm, does output have infinity or NaN values? And does batch_labels look normal (like the values are all non-negative etc.)?

WUyinwei-hah commented 1 year ago

Hmm, does output have infinity or NaN values? And does batch_labels look normal (like the values are all non-negative etc.)?

The output are mostly nan

The batch_labels and batch_inputs looks quite normal:

BarclayII commented 1 year ago

I see, then NaN is occurring in the forward propagation. Could you locate which computation does not have NaN as input but outputs NaN? I'm not sure if PyTorch has a tool detecting NaNs in the forward pass, so you might need to print torch.isnan(...).any() here and there.

WUyinwei-hah commented 1 year ago

@BarclayII

I ran following code:

us, vs = blocks[i].edges(order='eid')
udeg, vdeg = 1 / torch.sqrt(blocks[i].in_degrees(us).float()), 1 / torch.sqrt(blocks[i].in_degrees(vs).float())

and found blocks[i].in_degrees(us).float() produced many huge numbers:

including negative numbers, so

but in contract, vdeg looks normal.

BarclayII commented 1 year ago

I see the problem. block is a directed bipartite graph, and the source nodes (i.e. us) only have outgoing edges. So you should call blocks[i].out_degrees(us) and blocks[i].in_degrees(vs).

I guess since there is no incoming edge type for the source nodes and we did not have a sanity check for that, it returned arbitrary content. We should add a sanity check for in_degrees and out_degrees.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

dmlc / dgl