Closed VoVAllen closed 2 years ago
Reproduction:
Run in directory examples/pytorch/correct_and_smooth
:
python main.py --dataset ogbn-products --model linear --dropout 0.5 --epochs 1000 --lr 0.1 --gpu -1
python main.py --dataset ogbn-products --model linear --pretrain --correction-alpha 1. --smoothing-alpha 0.9 --gpu -1
I wasn't able to reproduce it on my p2.8x (Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz) but it's a newer CPU.
@sanchit-misra Could you please confirm if @VoVAllen's hypothesis is indeed the case?
@BarclayII will take a look.
I was not able to reproduce this but I don't have access to such an old system. :-) According to the libxsmm developers (who are my colleagues), while libxsmm did not support any architectures without at least AVX2 support, it did not explicitly check whether the underlying architecture is supported. So, it would not throw an error if the architecture was not supported. So, not sure where this error came from.
Having said that, libxsmm now explicitly checks whether the architecture is supported. And if it is not supported, it returns a nullptr kernel. I check this and fall back to naive kernel.
I'm hitting this on an older Xeon as well:
$ cat /proc/cpuinfo
...
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2470 v2 @ 2.40GHz
stepping : 4
microcode : 0x42d
cpu MHz : 2399.897
cache size : 25600 KB
...
There's a GeForce RTX 2060 present in the system. Running a Python 3.9 virtualenv with stable DGL installed with pip install dgl dglgo -f https://data.dgl.ai/wheels/repo.html
all on Ubuntu 20.04.4 LTS Server. The backend is pytorch. Happy to provide more info if you tell me what you need...
(Edit: just tried on my gaming PC which has a Core i5-3570. Same error message. Unfortunately the above Xeon is the newest CPU I have access to anywhere, so it would be really cool if libxssm would be a little more accomodating to people who don't have the newest hardware. Linux distro there is Manjaro with all the latest updates applied, so quite a different beast from the work server's Ubuntu ... the GPU is only a GTX 960, though, but I believe this is about CPU, not GPU.)
Hi, I've got the same issue as well:
File "HGCNN_2_caller.py", line 387, in <module>
output = model(dynamic_graphs, timestamps)
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/nfs/vinci.1/home/choudhuri/temporal-gcn/Graph_GCN_V2.py", line 170, in forward
h_dict = self.new_layer_1_base(current_graph, train_embeds)
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/nfs/vinci.1/home/choudhuri/temporal-gcn/Graph_GCN_V2.py", line 60, in forward
G.multi_update_all(funcs, "stack")
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/heterograph.py", line 5023, in multi_update_all
all_out[dtid].append(core.message_passing(g, mfunc, rfunc, afunc))
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/core.py", line 357, in message_passing
ndata = invoke_gspmm(g, mfunc, rfunc)
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/core.py", line 332, in invoke_gspmm
z = op(graph, x)
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/ops/spmm.py", line 189, in func
return gspmm(g, 'copy_lhs', reduce_op, x, None)
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/ops/spmm.py", line 75, in gspmm
ret = gspmm_internal(g._graph, op,
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/backend/pytorch/sparse.py", line 757, in gspmm
return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 118, in decorate_fwd
return fwd(*args, **kwargs)
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/backend/pytorch/sparse.py", line 126, in forward
out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
File "/home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/sparse.py", line 228, in _gspmm
_CAPI_DGLKernelSpMM(gidx, op, reduce_op,
File "dgl/_ffi/_cython/./function.pxi", line 293, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 239, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [13:06:10] /opt/dgl/src/array/cpu/./spmm_blocking_libxsmm.h:267: Failed to generate libxsmm kernel for the SpMM operation!
Stack trace:
[bt] (0) /home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fcf9d6c72ef]
[bt] (1) /home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/libdgl.so(void dgl::aten::cpu::SpMMRedopCsrOpt<long, float, dgl::aten::cpu::op::CopyLhs<float>, dgl::aten::cpu::op::Add<float> >(dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray)+0x3d4) [0x7fcf9d90c304]
[bt] (2) /home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/libdgl.so(void dgl::aten::cpu::SpMMSumCsrLibxsmm<long, float, dgl::aten::cpu::op::CopyLhs<float> >(dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray)+0x73) [0x7fcf9d90c3b3]
[bt] (3) /home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/libdgl.so(void dgl::aten::cpu::SpMMSumCsr<long, float, dgl::aten::cpu::op::CopyLhs<float> >(dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray)+0x12f) [0x7fcf9d9279bf]
[bt] (4) /home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/libdgl.so(void dgl::aten::SpMMCsr<1, long, 32>(std::string const&, std::string const&, dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0xcd3) [0x7fcf9d93dd13]
[bt] (5) /home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/libdgl.so(dgl::aten::SpMM(std::string const&, std::string const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0x13d5) [0x7fcf9d96ff65]
[bt] (6) /home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/libdgl.so(+0x4703e8) [0x7fcf9d9843e8]
[bt] (7) /home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/libdgl.so(+0x470981) [0x7fcf9d984981]
[bt] (8) /home/choudhuri/anaconda3/lib/python3.8/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fcf9d9d62d8]
I am using a system with Intel(R) Xeon(R) CPU E5645 @ 2.40GHz - 1.77/2.40GHz.
I seem to be getting the issue while using multi_update_all.
Hello, I've got the same issue as well:
Traceback (most recent call last):
File "/home/coder/project/project/GraphProject/zgraph-lite/test_gcn.py", line 25, in test_gcn_ogb
emb, predicted, model_bs = model.train(g.ndata['feat'].to(device),
File "/home/coder/project/project/GraphProject/zgraph-lite/zgraph/alg/embedding/gcn/gcn.py", line 68, in train
loss = model.get_loss(blocks, feat_in, label_in)
File "/home/coder/project/project/GraphProject/zgraph-lite/zgraph/alg/embedding/gcn/gcn.py", line 119, in get_loss
logits = self.forward(blocks, feat_in)
File "/home/coder/project/project/GraphProject/zgraph-lite/zgraph/alg/embedding/gcn/gcn.py", line 115, in forward
h = layer(block, h)
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, *kwargs)
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 423, in forward
graph.update_all(aggregate_fn, fn.sum(msg='m', out='h'))
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/heterograph.py", line 4895, in update_all
ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/core.py", line 357, in message_passing
ndata = invoke_gspmm(g, mfunc, rfunc)
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/core.py", line 332, in invoke_gspmm
z = op(graph, x)
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/ops/spmm.py", line 189, in func
return gspmm(g, 'copy_lhs', reduce_op, x, None)
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/ops/spmm.py", line 75, in gspmm
ret = gspmm_internal(g._graph, op,
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/backend/pytorch/sparse.py", line 724, in gspmm
return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 118, in decorate_fwd
return fwd(args, **kwargs)
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/backend/pytorch/sparse.py", line 106, in forward
out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
File "/home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/sparse.py", line 228, in _gspmm
_CAPI_DGLKernelSpMM(gidx, op, reduce_op,
File "dgl/_ffi/_cython/./function.pxi", line 293, in dgl._ffi._cy3.core.FunctionBase.call
File "dgl/_ffi/_cython/./function.pxi", line 239, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [16:15:23] /opt/dgl/src/array/cpu/./spmm_blocking_libxsmm.h:267: Failed to generate libxsmm kernel for the SpMM operation!
Stack trace:
[bt] (0) /home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f7e50ea5f5f]
[bt] (1) /home/coder/bin/anaconda3/envs/test_dgl/lib/python3.9/site-packages/dgl/libdgl.so(void dgl::aten::cpu::SpMMRedopCsrOpt<long, float, dgl::aten::cpu::op::CopyLhs
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
Here's some activity. Everything about what constitutes the issue has been said, hasn't it? How is closing it automatically by means of a bot going to accomplish anything other but the digital version of sweeping it under the rug?
HI, @sixtyfive , there is already a PR for fixing this issue #4455
The reason of this error is Libxsmm is not supported in some old versions of CPU. As it is not easy to pre-check if the lib is supported in current CPU, we provide an API to disable it at runtime:
dgl.use_libxsmm(bool)
There is a prompt the first time failed using this lib, and you can use the API to disable it for next running.
Feel free to reopen if any further questions.
@peizhou001
Sorry for reopening this old thread, but I am not a programmer and can't figure out how to use this API call. For compatibility reasons I've install dgl 1.0.2.
Where, and how, do I use dgl.use_libxsmm(flag)?
Thanks.
🐛 Bug
User met errors raised at https://github.com/dmlc/dgl/blob/983a4fdd1981a6eaa4a3343ec4116739e9f97dfa/src/array/cpu/spmm_blocking_libxsmm.h#L267
We should not raised the error but fallback to the naive kernel.
User's CPU model: Xeon(R) CPU E5-2695 v2 @ 2.40GHz This is an old CPU produced in 2013, which might not be supported by LibXSMM now. https://ark.intel.com/content/www/us/en/ark/products/75281/intel-xeon-processor-e52695-v2-30m-cache-2-40-ghz.html
Possible Solution
Catch error at https://github.com/dmlc/dgl/blob/983a4fdd1981a6eaa4a3343ec4116739e9f97dfa/src/array/cpu/spmm.h#L144, and run the naive kernel if error detected