Closed ljishen closed 2 years ago
This looks a bit low. It's around 18 Gbps (70000 requests per second, 32 KB per request), which is close to what we've measured with just one request outstanding (Figure 6 in the NSDI paper). But your test has a window size of 32.
I'd suggest using the large_rpc_tput
application instead for this test, since I've tested that one more for bulk transfers. I use server_rate
only for rate measurements for small RPCs.
Thanks for your response. I'll try to run large_rpc_tput
and let you know the results.
With the following configuration for large_rpc_tput
, I can only get ~12.5 Gbps out of the the 100 Gbps NIC.
10.10.1.2 31850 1
--test_ms 30000
--req_size 32768
--resp_size 32
--num_processes 2
--num_proc_0_threads 1
--num_proc_other_threads 1
--concurrency 1
--drop_prob 0.0
--profile incast
--throttle 0
--throttle_fraction 0.9
--numa_0_ports 0
--numa_1_ports 0
--numa_node 1
--process_id 0
large_rpc_tput: Thread 0: Creating 1 session to proc 0, thread 0.
large_rpc_tput: Thread 0: All sessions connected.
large_rpc_tput: Thread 0: Tput {RX 0.01 (47536), TX 12.46 (47537)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {21.0 50th, 23.2 99th, 25.1 99.9th}. Timely rate 100.0 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.01 (47541), TX 12.46 (47541)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {21.0 50th, 23.1 99th, 25.0 99.9th}. Timely rate 100.0 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.01 (47578), TX 12.47 (47578)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {21.0 50th, 23.0 99th, 24.8 99.9th}. Timely rate 100.0 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.01 (47793), TX 12.53 (47793)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {20.9 50th, 22.8 99th, 25.3 99.9th}. Timely rate 100.0 Gbps. Credits 32 (best = 32).
What do you suggest to change/investigate to improve the performance? Is the code not optimized for ConnectX-6?
Thanks.
Two suggestions:
concurrency
, else the benchmark keeps only one RPC in flight, which makes bandwidth latency-boundib_transport.h
:-static constexpr size_t kMTU = kIsRoCE ? 1024 : 3840;
+static constexpr size_t kMTU = kIsRoCE ? 3840 : 3840;
I increased the concurrency to 32, and the MTU to 4096 (same as the active_mtu), and got the following results with one core for sending and one core for receiving:
r7525 is the name of a host server, the local_bf2 is the local-attached BlueField-2 card of the host. r7525-r7525 is the throughput between two hosts, and the r7525-local_bf2 is the throughput between the host and the local BlueField-2. I think that the peak number is still lower than what I expected with pretty powerful hosts.
Thanks for sharing the results. I think this throughput is close to what I expect eRPC to achieve with one core.
Achieving even higher per-core throughput (e.g., 100--200 Gbps) is an interesting research topic. Techniques like segmentation/receive offloads, faster memcpy (e.g., with the upcoming DSA accelerators, or clever page remapping), and faster packet I/O (e.g., with faster buses than PCIe) could help.
I see. Thanks for your interesting feedback.
I ran the server_rate app over RoCE between two machines connected using 100Gbps NICs (BlueField-2 DPU).
I compiled the program with
cmake .. -DPERF=ON -DTRANSPORT=infiniband -DROCE=ON
. The app parameters for the server-side areIs that the performance we can expect with a single core? Thanks!