Open vsag96 opened 4 years ago
Hi. Thanks for reporting this issue.
Can you comfirm if ib_read_bw
is working over RoCE?
On the server I started with ib_read_bw and on the client with ib_read_bw with
The server side trace
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet
local address: LID 0000 QPN 0x02c9 PSN 0x2df5d7 OUT 0x10 RKey 0x003572 VAddr 0x007f65b20e9000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:04 remote address: LID 0000 QPN 0x0239 PSN 0x2d5540 OUT 0x10 RKey 0x003582 VAddr 0x007f1a3666e000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:03
65536 1000 10222.53 6831.89 0.109310
On the client.
RDMA_Read BW Test
Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON TX depth : 128 CQ Moderation : 1 Mtu : 1024[B] Link type : Ethernet GID index : 3 Outstand reads : 16 rdma_cm QPs : OFF Data ex. method : Ethernet
local address: LID 0000 QPN 0x0239 PSN 0x2d5540 OUT 0x10 RKey 0x003582 VAddr 0x007f1a3666e000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:03 remote address: LID 0000 QPN 0x02c9 PSN 0x2df5d7 OUT 0x10 RKey 0x003572 VAddr 0x007f65b20e9000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:01:04
Conflicting CPU frequency values detected: 1178.116000 != 800.186000. CPU Frequency is not max. 65536 1000 10222.53 6831.89 0.109310
Thanks for the details.
Could you please test if eRPC works on your setup with older Mellanox drivers (e.g., Mellanox OFED 4.4)? There have been a lot of recent NIC driver changes and I've not kept the code up to date.
I am aware that eRPC doesn't build anymore with the Raw transport with new Mellanox OFED versions (or rdma_core) because the ibverbs API has changed. I plan to fix this eventually but I'm not sure when I'll have the time.
Hi Dr. Kalia I encountered the same issue. I have one 4-node cluster and one 2-node cluster. The former is equipped with CX5 NIC and the latter CX4 NIC. eRPC runs well within each cluster but not between them.
I use the 4-node cluster as servers 2-node cluster as clients. On the client side I have
96:964338 WARNG: Rpc 0: Received connect response from [H: 10.0.0.40:31851, R: 0, S: XX] for session 0. Issue: Error [Routing resolution failure].
First I thought it was due to invalid LIDs (ibv_devinfo
shows all ports' LIDs are 0, which is invalid), but since eRPC worked within each cluster, maybe the 0 LIDs were just fine. Then I checked eRPC's source code and noticed that eRPC seemed not to be able to successfully create AH in IBTransport::create_ah
. So I thought maybe the two clusters couldn't communicate using UD, but ib_send_bw -c UD
and ib_read_bw
both worked.
Could you give any advice for further troubleshooting?
Hi! The verbs address handle creation process is a bit complex so it's likely I missed something in my implementation of create_ah
. The implementation is different for RoCE and InfiniBand (see https://github.com/erpc-io/eRPC/blob/75e3015d17fa4693427487dbc783dc01249c36df/src/transport_impl/infiniband/ib_transport.cc#L74), so I assume you're passing -DROCE=on
if you're using RoCE.
My suggestion to fix this would be to see how the perftest
package implements address handle resolution, and use that information to try fixing eRPC's create_ah
.
Hi Dr. Kalia
Sorry for my late reply because I had a holiday and spent some time finding the create_ah
issue.
Thanks to your precise analysis, I am able to find that the resolution failure
error is caused by unmatched GIDs. In file
https://github.com/erpc-io/eRPC/blob/d35a86dcf92757b77ff187f15f7bf67a4ebc0221/src/transport_impl/infiniband/ib_transport.cc#L18
, the kDefaultGIDIndex
works for the most of time, but unluckily, my two clusters have different NIC configurations. Thus the default value picks the wrong GID in one cluster when RoCE is enabled, thus the two clusters fail to communicate.
The reason ib_send_bw -c UD
works is that it requires users to offer a GID index and device ID, thus it can always get the correct GID. I guess maybe it is also a good idea for eRPC to require users to offer an optional valid GID index?
Hi,
OFED version 5.0.2 NIC Mellanox Connect X-5 OS Ubuntu 18 A tofino Switch does simple packet forwarding for us.
If I try with -DTransport=infiband and -DROCE=on I am able to build successfully and when I use the hello world app, on the client side I get the following error
Received connect response from [H: 192.168.1.4:31850, R: 0, S: XX] for session 0. Issue: Error [Routing resolution failure]
The server is receiving the initial connect packet from the client and then client segfaults and server prints the below statement in a loop. The error on the server is as follows.
Received connect request from [H: 192.168.1.3:31850, R: 0, S: 0]. Issue: Unable to resolve routing info [LID: 0, QPN: 449, GID interface ID 16601820732604482458, GID subnet prefix 33022]. Sending response.
In the README, you mentioned to use -Dtransport=raw for Mellanox NIC's. I was not able to build with that flag. Error Trace We want to use eRPC over ROCEv2 + DCQCN. We are okay with IB, unless you tell us otherwise. The RDMA devices are on on
rdma link
andibdev2netdev
.