WukLab / LITE

LITE Kernel RDMA Support for Datacenter Applications. SOSP 2017.
106 stars 19 forks source link

Faild to add cluster when client node equal or greater than 2 node #9

Closed alvinkwok1 closed 6 years ago

alvinkwok1 commented 6 years ago

Faild to add cluster when client node equal or greater than 2 node.

error: [root@localhost cluster-manager]# ./mgmt-server Initialize Server Hostname: localhost.localdomain IB-port: 1 Eth-port: 18500 Option: 0 max qp 163776 UD qpn 96 local address: LID 0x0000, QPN 0x000061, PSN 0x1d3b53, GID fe80::6eb3:11ff:fe4d:ca8 local address: LID 0x0000, QPN 0x000062, PSN 0xb455a1, GID fe80::6eb3:11ff:fe4d:ca8 loopback create successfully Do a post-receive with 2048 IB Preparation for the incoming 1 connection from 172.16.3.105 send NODE_ID 1 Get Connection from 172.16.3.105: 0001:0000:00004b:50f524:fe800000000000006eb311fffe4d0b88 server_keep_server_alive: UD message from 1 with qpn 74 and lid 0: 0x7f32990888c0 IB Preparation for the incoming 2 connection from 172.16.3.104 send NODE_ID 2 Get Connection from 172.16.3.104: 0002:0000:000069:4bd5c0:00000000000000000000000000000000 server_keep_server_alive: UD message from 2 with qpn 104 and lid 0: (nil) Segmentation fault (core dumped)

master-system: [root@localhost cluster-manager]# uname -r 3.10.108-lite-kernel

the kernel 3.10.108 and apply lite-patch. if I not apply patch,I can‘t install official IB ofed libraries.

client-system: same with master-system. The libraries through yum install.

shinyehtsai commented 6 years ago

Are you using RoCE? the LID is zero here.

alvinkwok1 commented 6 years ago

yes,I usind ROCE

shinyehtsai commented 6 years ago

Did you enable LITE_ROCE at lite.h and client.h?

alvinkwok1 commented 6 years ago

yes, I modify that file (lite.h in core and client.h in cluster_manager) and enable #define LITE_ROCE

shinyehtsai commented 6 years ago

That's weird. From the log, the system doesn't receive correct information from node 2 Get Connection from 172.16.3.105: 0001:0000:00004b:50f524:fe800000000000006eb311fffe4d0b88 Get Connection from 172.16.3.104: 0002:0000:000069:4bd5c0:00000000000000000000000000000000 It should not be zero. Did you also configure LITE_ROCE on node 2?

alvinkwok1 commented 6 years ago

Thanks,I solve the problem, I ignore confiure LITE_ROCE on node 2.

shinyehtsai commented 6 years ago

Cool. Just let me know if you meet other issues.