erpc-io / eRPC

Efficient RPCs for datacenter networks
Other
860 stars 139 forks source link

Low performance #75

Closed ljishen closed 2 years ago

ljishen commented 2 years ago

I ran the server_rate app over RoCE between two machines connected using 100Gbps NICs (BlueField-2 DPU).

I compiled the program with cmake .. -DPERF=ON -DTRANSPORT=infiniband -DROCE=ON. The app parameters for the server-side are

10.10.2.1 31850 1
--test_ms 200000
--sm_verbose 1
--num_server_threads 1
--num_client_threads 1
--window_size 32
--req_size 32768
--resp_size 32
--num_processes 2
--process_id 0
--numa_node 1
--numa_0_ports 0
--numa_1_ports 0

image

Is that the performance we can expect with a single core? Thanks!

anujkaliaiitd commented 2 years ago

This looks a bit low. It's around 18 Gbps (70000 requests per second, 32 KB per request), which is close to what we've measured with just one request outstanding (Figure 6 in the NSDI paper). But your test has a window size of 32.

I'd suggest using the large_rpc_tput application instead for this test, since I've tested that one more for bulk transfers. I use server_rate only for rate measurements for small RPCs.

ljishen commented 2 years ago

Thanks for your response. I'll try to run large_rpc_tput and let you know the results.

ljishen commented 2 years ago

With the following configuration for large_rpc_tput, I can only get ~12.5 Gbps out of the the 100 Gbps NIC.

10.10.1.2 31850 1
--test_ms 30000
--req_size 32768
--resp_size 32
--num_processes 2
--num_proc_0_threads 1
--num_proc_other_threads 1
--concurrency 1
--drop_prob 0.0
--profile incast
--throttle 0
--throttle_fraction 0.9
--numa_0_ports 0
--numa_1_ports 0
--numa_node 1
--process_id 0
large_rpc_tput: Thread 0: Creating 1 session to proc 0, thread 0.
large_rpc_tput: Thread 0: All sessions connected.
large_rpc_tput: Thread 0: Tput {RX 0.01 (47536), TX 12.46 (47537)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {21.0 50th, 23.2 99th, 25.1 99.9th}. Timely rate 100.0 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.01 (47541), TX 12.46 (47541)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {21.0 50th, 23.1 99th, 25.0 99.9th}. Timely rate 100.0 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.01 (47578), TX 12.47 (47578)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {21.0 50th, 23.0 99th, 24.8 99.9th}. Timely rate 100.0 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.01 (47793), TX 12.53 (47793)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {20.9 50th, 22.8 99th, 25.3 99.9th}. Timely rate 100.0 Gbps. Credits 32 (best = 32).

What do you suggest to change/investigate to improve the performance? Is the code not optimized for ConnectX-6?

Thanks.

anujkaliaiitd commented 2 years ago

Two suggestions:

-static constexpr size_t kMTU = kIsRoCE ? 1024 : 3840;
+static constexpr size_t kMTU = kIsRoCE ? 3840 : 3840;
ljishen commented 2 years ago

I increased the concurrency to 32, and the MTU to 4096 (same as the active_mtu), and got the following results with one core for sending and one core for receiving:

image

r7525 is the name of a host server, the local_bf2 is the local-attached BlueField-2 card of the host. r7525-r7525 is the throughput between two hosts, and the r7525-local_bf2 is the throughput between the host and the local BlueField-2. I think that the peak number is still lower than what I expected with pretty powerful hosts.

anujkaliaiitd commented 2 years ago

Thanks for sharing the results. I think this throughput is close to what I expect eRPC to achieve with one core.

Achieving even higher per-core throughput (e.g., 100--200 Gbps) is an interesting research topic. Techniques like segmentation/receive offloads, faster memcpy (e.g., with the upcoming DSA accelerators, or clever page remapping), and faster packet I/O (e.g., with faster buses than PCIe) could help.

ljishen commented 2 years ago

I see. Thanks for your interesting feedback.