erpc-io / eRPC

Efficient RPCs for datacenter networks
Other
858 stars 140 forks source link

RoCE support #10

Closed vsbenas closed 6 years ago

vsbenas commented 6 years ago

I see that ConnectX-3 NICs are supported in RoCE, but other NICs are coming soon. How do I select RoCE?

I am trying to set up softRoCE to work with eRPC, and while I do realize it's still under development, I've had some success with minimal changes: https://github.com/vsbenas/eRPC/commit/2aa7d3c6c1b3c3db003910fdf3afd0b9b2343ee9

I launched the hello_world app, and it seems the networking is working but there is an error on the client side once both are launched. 42:025353 WARNG: Rpc 0: Received connect response from [H: 192.168.122.103:31850, R: 0, S: XX] for session 0. Issue: Error [Invalid remote Rpc ID]. Segmentation fault

anujkaliaiitd commented 6 years ago

The answer is a bit complex.

RoCE isn't "officially" supported as top-level transport backend in eRPC, meaning that manual tweaking is needed to make it work. I now see that SoftRoCE allows users to try eRPC without expensive NICs, so adding RoCE support is on my todo list.

I tested that eRPC works with SoftRoCE on a machine with Ubuntu 18.04 and Mellanox OFED 4.4.1, after applying the patch below. This machine does not have a RoCE-capable hardware NIC. eRPC also works over RoCE on a different machine with a ConnectX-4 Ethernet NIC.

diff --git a/src/transport_impl/infiniband/ib_transport.cc b/src/transport_impl/infiniband/ib_transport.cc
index 2a6ac79..270f522 100644
--- a/src/transport_impl/infiniband/ib_transport.cc
+++ b/src/transport_impl/infiniband/ib_transport.cc
@@ -138,7 +138,7 @@ void IBTransport::init_verbs_structs() {

   create_attr.cap.max_send_wr = kSQDepth;
   create_attr.cap.max_recv_wr = kRQDepth;
-  create_attr.cap.max_send_sge = 1;  // XXX: WHY DOES THIS WORK!!
+  create_attr.cap.max_send_sge = 2;
   create_attr.cap.max_recv_sge = 1;
   create_attr.cap.max_inline_data = kMaxInline;

diff --git a/src/transport_impl/infiniband/ib_transport.h b/src/transport_impl/infiniband/ib_transport.h
index 7de6433..cb83dc1 100644
--- a/src/transport_impl/infiniband/ib_transport.h
+++ b/src/transport_impl/infiniband/ib_transport.h
@@ -15,8 +15,8 @@ namespace erpc {
 class IBTransport : public Transport {
  public:
   // Transport-specific constants
-  static constexpr TransportType kTransportType = TransportType::kInfiniBand;
-  static constexpr size_t kMTU = 3840;  ///< Make (kRecvSize / 64) prime
+  static constexpr TransportType kTransportType = TransportType::kRoCE;
+  static constexpr size_t kMTU = 1024;  ///< Make (kRecvSize / 64) prime
   static constexpr size_t kRecvSize = (kMTU + 64);  ///< RECV size (with GRH)
   static constexpr size_t kRQDepth = kNumRxRingEntries;  ///< RECV queue depth
   static constexpr size_t kSQDepth = 128;                ///< Send queue depth

Steps to compile eRPC for RoCE after applying the patch:

  1. cmake . -DTRANSPORT=infiniband
  2. Change kHeadroom in src/config.h to 40
  3. make
  4. sudo ctest

One test (destroy_session_test) fails on my machine. This happens because the SoftRoCE device that ships with Mellanox OFED has max_ah = 100, which is unreasonably tiny. On Mellanox hardware NICs, max_ah is over a billion.

vsbenas commented 6 years ago

Thank you very much! I implemented your suggestions and tried the hello_world app. Now the client exits if the server is online, so it's still not fully functional.

I tried the tests too - the destroy_session_test freezes for me. But if I remove destroy_session_test all the tests after req_in_cont_func_test fail. After running sudo ctest all my hugepages seem to get taken:

HugePages_Total: 512
HugePages_Free: 0

Maybe a memory leak?

anujkaliaiitd commented 6 years ago

The huge pages get leaked only if the code (in this case a test) crashes. scripts/utils.sh has a drop_shm command that frees up the huge pages. LMK if any test (except destroy_session_test) fails even when starting with 512 huge pages.

Does the hello_world client print "hello" before exiting?

vsbenas commented 6 years ago

Two tests fail for me: large_msg_test and req_in_req_func_test. All the other tests run fine. Here is the verbose output of the failing tests: https://pastebin.com/BTQLgvQa

As for the hello_world, sometimes the client just closes, sometimes outputs an error:

b@b-Standard-PC-i440FX-PIIX-1996:~/erpc/hello_world$ sudo ./client
84:145824 WARNG: eRPC Nexus: Testing enabled. Perf will be low.
84:208422 WARNG: Modded driver unavailable. Performance will be low.
b@b-Standard-PC-i440FX-PIIX-1996:~/erpc/hello_world$ sudo ./client
88:192761 WARNG: eRPC Nexus: Testing enabled. Perf will be low.
88:195690 WARNG: Modded driver unavailable. Performance will be low.
91:180961 WARNG: Rpc 0: Received connect response from [H: 192.168.122.14:31850, R: 0, S: XX] for session 0. Issue: Error [Invalid remote Rpc ID].
Segmentation fault

There is no output on the server side besides the warnings. I'm using Ubuntu 18.04 on a KVM with softRoCE installed.

b@b-Standard-PC-i440FX-PIIX-1996:~/erpc/hello_world$ rxe_cfg
  Name  Link  Driver  Speed  NMTU  IPv4_addr  RDEV  RMTU          
  ens3  yes   e1000                           rxe0  1024  (3) 
anujkaliaiitd commented 6 years ago

Thanks for the detailed trace. I have some fixes below that I wish to implement, but it might take a while due to upcoming deadlines.

vsbenas commented 6 years ago

There is no indication when the server starts, I think both programs hang on the new erpc:Rpc line. server:

a@a-Standard-PC-i440FX-PIIX-1996:~/erpc/hello_world$ sudo ./server
96:016500 WARNG: eRPC Nexus: Testing enabled. Perf will be low.
96:016537 INFOR: eRPC Nexus: Launching 0 background threads.
96:016548 INFOR: eRPC Nexus: Launching session management thread on core 1.
96:016710 INFOR: eRPC Nexus: Created with management UDP port 31850, hostname 192.168.122.14.
96:041738 INFOR: Port 0 resolved to device rxe0, port 1. Speed = 2.50 Gbps.
96:042316 WARNG: Modded driver unavailable. Performance will be low.
96:042354 INFOR: IBTransport created for ID 0. Device rxe0, port 1.
96:049030 INFOR: Registered 6 MB (lkey = 2564)
96:078183 INFOR: Registered 16 MB (lkey = 2824)
96:078303 INFOR: Rpc 0 created. eRPC TID = 0.

client:

b@b-Standard-PC-i440FX-PIIX-1996:~/erpc/hello_world$ sudo ./client
2:746192 WARNG: eRPC Nexus: Testing enabled. Perf will be low.
2:746211 INFOR: eRPC Nexus: Launching 0 background threads.
2:746217 INFOR: eRPC Nexus: Launching session management thread on core 1.
2:746315 INFOR: eRPC Nexus: Created with management UDP port 31850, hostname 192.168.122.103.
2:747562 INFOR: Port 0 resolved to device rxe0, port 1. Speed = 2.50 Gbps.
2:747901 WARNG: Modded driver unavailable. Performance will be low.
2:747914 INFOR: IBTransport created for ID 0. Device rxe0, port 1.
2:748601 INFOR: Registered 6 MB (lkey = 3467)
2:753193 INFOR: Registered 16 MB (lkey = 3606)
2:753293 INFOR: Rpc 0 created. eRPC TID = 0.

The trace files for both of them are empty. Edit: after a minute or so the server prints out this too:

81:319790 INFOR: eRPC Nexus: Destroying Nexus.
81:326806 INFOR: eRPC Nexus: Session management thread exiting.
81:327023 WARNG: Rpc: Deleting Nexus, but a worker is still registeredserver: /home/a/erpc/src/nexus_impl/nexus.cc:93: erpc::Nexus::~Nexus(): Assertion `false' failed.
Aborted
anujkaliaiitd commented 6 years ago

If the "Rpc 0 created" line is printed but the constructor does not return, I suspect that the wheel->catchup() call in rpc.cc gets stuck. Are you using the lastest eRPC code? I pushed a fix for an issue like this a few days ago.

The issue above is unlikely since most of the tests are passing. The constructors do return in the tests, so it's weird they aren't returning in the app. Can you check if the tests pass on both the client and the server VMs?

The timing wheel is part of eRPC's congestion control subsystem. Can you check if disabling congestion control helps? This can be done by setting kEnableCc and kEnableCcOpts to false in tweakme.h.

vsbenas commented 6 years ago

Okay I've updated to the newest version, the constructor definitely returns now. The tests run on both machines. The current output is: server:

a@a-Standard-PC-i440FX-PIIX-1996:~/erpc/hello_world$ sudo ./server
70:021850 WARNG: eRPC Nexus: Testing enabled. Perf will be low.
70:021875 INFOR: eRPC Nexus: Launching 0 background threads.
70:021881 INFOR: eRPC Nexus: Launching session management thread on core 1.
70:022009 INFOR: eRPC Nexus: Created with management UDP port 31850, hostname 192.168.122.14.
70:023571 INFOR: Port 0 resolved to device rxe0, port 1. Speed = 2.50 Gbps.
70:023904 WARNG: Modded driver unavailable. Performance will be low.
70:023915 INFOR: IBTransport created for ID 0. Device rxe0, port 1.
70:025981 INFOR: Registered 6 MB (lkey = 46944)
70:036545 INFOR: Registered 16 MB (lkey = 47296)
70:036667 INFOR: Rpc 0 created. eRPC TID = 0.
74:414762 INFOR: eRPC Nexus: Received SM packet [Connect request], [No error], client: [H: 192.168.122.103:31850, R: 0, S: 0], server: [H: 192.168.122.14:31850, R: 0, S: XX]
74:415048 INFOR: Rpc 0: Received connect request from [H: 192.168.122.103:31850, R: 0, S: 0]. Issue: None. Sending response.
74:415059 INFOR: Rpc 0: Sending packet [Connect response], [No error], client: [H: 192.168.122.103:31850, R: 0, S: 0], server: [H: 192.168.122.14:31850, R: 0, S: 0].

Again, the client does not output hello, I tried increasing the event loop but no luck. client:

b@b-Standard-PC-i440FX-PIIX-1996:~/erpc/hello_world$ sudo ./client
74:429561 WARNG: eRPC Nexus: Testing enabled. Perf will be low.
74:429583 INFOR: eRPC Nexus: Launching 0 background threads.
74:429588 INFOR: eRPC Nexus: Launching session management thread on core 1.
74:429676 INFOR: eRPC Nexus: Created with management UDP port 31850, hostname 192.168.122.103.
74:431395 INFOR: Port 0 resolved to device rxe0, port 1. Speed = 2.50 Gbps.
74:431896 WARNG: Modded driver unavailable. Performance will be low.
74:431928 INFOR: IBTransport created for ID 0. Device rxe0, port 1.
74:433506 INFOR: Registered 6 MB (lkey = 3372)
74:437721 INFOR: Registered 16 MB (lkey = 3672)
74:437819 INFOR: Rpc 0 created. eRPC TID = 0.
74:438043 INFOR: Rpc 0: Sending packet [Connect request], [No error], client: [H: 192.168.122.103:31850, R: 0, S: 0], server: [H: 192.168.122.14:31850, R: 0, S: XX].
74:439026 INFOR: eRPC Nexus: Received SM packet [Connect response], [No error], client: [H: 192.168.122.103:31850, R: 0, S: 0], server: [H: 192.168.122.14:31850, R: 0, S: 0]
74:439167 INFOR: Rpc 0: Received connect response from [H: 192.168.122.14:31850, R: 0, S: 0] for session 0. Issue: None. Session connected.
74:539190 INFOR: Destroying Rpc 0.
74:539236 INFOR: Deregistered 6 MB (lkey = 3372)
74:539372 INFOR: Deregistered 16 MB (lkey = 3672)
74:539400 INFOR: Destroying transport for ID 0
74:539526 INFOR: eRPC Nexus: Deregistering Rpc 0.
74:539559 INFOR: eRPC Nexus: Destroying Nexus.
74:563323 INFOR: eRPC Nexus: Session management thread exiting.

client trace file:

74:439189 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:444344 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:444387 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:450575 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:450638 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:455563 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:455609 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:460561 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:460594 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:465673 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:465718 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:472233 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:475399 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:477544 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:477705 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:482545 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:482594 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:487546 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:487594 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:492552 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:492616 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:497542 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:497576 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:502533 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:502551 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:507534 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:507554 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:512548 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:512588 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:517535 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:517553 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:522536 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:522554 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:527537 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:527555 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:532947 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:532984 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
74:538388 REORD: Rpc 0, lsn 0 (192.168.122.14): Pkt loss suspected for req 8 ([num_tx 1, num_rx 0]). Action: Retransmitting requests.
74:538476 TRACE: Rpc 0, lsn 0 (192.168.122.14): TX [type REQ, dsn 0, reqn 8, pktn 0, msz 16]. Slot [num_tx 1, num_rx 0].
anujkaliaiitd commented 6 years ago

It seems that packets from the client are not reaching the server. I assume that the server's trace file is empty.

Do RDMA tests like ib_read_bw work between the client and the server?

vsbenas commented 6 years ago

Yes, the server's trace is empty. rping and ib_read_bw work both ways between the machines.

Have you tested out hello_world on virtual machines?

anujkaliaiitd commented 6 years ago

Thanks. I was able to replicate this issue with VMs.

The problem seems to be in how I'm using ibv_query_gid. perftest does some convoluted magic around ibv_query_gid, which I will need to port. This fix will take time (my ETA is end-of-October).

Here's a link to how perftest wraps around ibv_query_gid in case you want to try a fix: link.

vsbenas commented 6 years ago

Good to hear! I'm quite new to rdma, so I'd rather leave the job to the experts :)

anujkaliaiitd commented 6 years ago

I have pushed fixes for this issue. In the latest version, hello_world works between two VMs on AWS.

To try it out, build the latest eRPC with cmake . -DPERF=OFF -DTRANSPORT=infiniband -DROCE=on. The patch listed above is not needed anymore.

vsbenas commented 6 years ago

Can confirm it works! Great job.