SJTU-IPADS / wukong

A graph-based distributed in-memory store that leverages efficient graph exploration to provide highly concurrent and low-latency queries over big linked data
http://ipads.se.sjtu.edu.cn/projects/wukong
Apache License 2.0
188 stars 29 forks source link

got bad completion with status: 0xc, vendor syndrome: 0x81 #16

Closed wukuser closed 3 years ago

wukuser commented 3 years ago

Dear all,

I'm trying to run Wukong with RDMA enabled, but I'm getting this error at loading time:

got bad completion with status: 0xc, vendor syndrome: 0x81, with error transport retry counter exceeded, qp n:0 t:1

It seems that others had this issue:

Answers in issue #5 mention using a new version of libRMDA, but I can't access it. Answers in issue #15 mention changing ports or adding SO_REUSEPORT to setsockopt(), but I tried both and I'm still getting the same error.

Any ideas?

wxdwfc commented 3 years ago

Hi,

Can you give some more information about your hardware configurations? For example, which NIC you are using, link type (RoCE or Infiniband(IB)). Are two nodes correctly connected with RDMA? As mentioned in #5, currently Wukong has only been tested on IB not RoCE.

Thanks!

wukuser commented 3 years ago

Hi,

Thank you for your reply.

The links are indeed Infiniband and the machines use Mellanox ConnectX-4 adapters. Note that in order to avoid an error at startup, I had to add in $WUKONG_ROOT/deps/openmpi-1.6.5-install/share/openmpi/mca-btl-openib-device-params.ini:

[Mellanox ConnectX4]    
vendor_id = 0x2c9,0x5ad,0x66a,0x8f1,0x1708,0x03ba,0x15b3,0x119f    
vendor_part_id = 4115    
use_eager_rdma = 1    
mtu = 4096    
max_inline_data = 256