Closed ethercflow closed 10 months ago
Hi @ethercflow
ibverbs-providers
etc that install libmlx5 haven't been installed.Both the RoCE and DPDK implementations can get good performance, though we need ~2 cores to reach 100 Gbps. For high bandwidth, increasing the MTU in eRPC's source will improve performance significantly, e.g., https://github.com/erpc-io/eRPC/blob/094c17c3cd9b48bcfbed63f455cc85b9976bd43f/src/transport_impl/dpdk/dpdk_transport.h#L191, and https://github.com/erpc-io/eRPC/blob/094c17c3cd9b48bcfbed63f455cc85b9976bd43f/src/transport_impl/dpdk/dpdk_transport.h#L210. For the latter line, we can use kMTU + 1024
instead of the hard-coded 2048
.
Thanks for your reply, it's very helpful to me. :)
Both the RoCE and DPDK implementations can get good performance, though we need ~2 cores to reach 100 Gbps. For high bandwidth, increasing the MTU in eRPC's source will improve performance significantly, e.g.,
. For the latter line, we can use
kMTU + 1024
instead of the hard-coded2048
.
Hi @anujkaliaiitd , sorry to bother you. I have another question: our IDC's nic and switch's MTU size is 9000. So what is the relationship between the value of kMTU
and 9000? What is the maximum value that kMTU
can be set?
Hi. You could try:
You could tweak the 8192 value a bit, e.g., 8900 could work. I don't recall the exact packet structure or padding needed, but 1000/100 spare bytes should be more than enough.
Hi. You could try:
- kMTU = 8192
- kMbufSize = static_cast
(sizeof(struct rte_mbuf)) + RTE_PKTMBUF_HEADROOM + 8192; You could tweak the 8192 value a bit, e.g., 8900 could work. I don't recall the exact packet structure or padding needed, but 1000/100 spare bytes should be more than enough.
@ankalia Thanks a lot, I use 8192
with large_rpc_tput
to bench throughput with parameters:
--test-ms 3000000
--req-size 1048576
--resp-size 32
--num-processes 2
--num-proc-0-threads 1
--num-proc-other-threads 1
--concurrency 1
--drop-prob 0.0
--profile incast
--throttle 0
--throttle-fraction 0.9
--numa-0-ports 2
--numa-1-ports 3
Got
large_rpc_tput: Thread 0: Tput {RX 0.00 (4857), TX 40.75 (4857)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {196.1 50th, 389.0 99th, 697.6 99.9th}. Timely rate 71.8 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4758), TX 39.92 (4758)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {196.8 50th, 451.8 99th, 783.1 99.9th}. Timely rate 16.0 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4925), TX 41.32 (4925)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {195.2 50th, 346.3 99th, 750.1 99.9th}. Timely rate 53.1 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4871), TX 40.86 (4871)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {196.4 50th, 379.5 99th, 757.2 99.9th}. Timely rate 34.3 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4934), TX 41.39 (4934)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {192.1 50th, 417.9 99th, 791.3 99.9th}. Timely rate 87.3 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (5007), TX 42.01 (5007)} Gbps (IOPS). Retransmissions 0. Packet RTTs: {-1.0, -1.0} us. RPC latency {192.1 50th, 340.1 99th, 710.7 99.9th}. Timely rate 90.7 Gbps. Credits 32 (best = 32).
If I increase --concurrency
to 2, Retransmissions will happen:
large_rpc_tput: Thread 0: Tput {RX 0.00 (4927), TX 41.33 (4927)} Gbps (IOPS). Retransmissions 199. Packet RTTs: {-1.0, -1.0} us. RPC latency {193.0 50th, 402.9 99th, 660.2 99.9th}. Timely rate 78.2 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4438), TX 37.23 (4438)} Gbps (IOPS). Retransmissions 198. Packet RTTs: {-1.0, -1.0} us. RPC latency {203.8 50th, 569.4 99th, 894.3 99.9th}. Timely rate 42.6 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4510), TX 37.84 (4510)} Gbps (IOPS). Retransmissions 199. Packet RTTs: {-1.0, -1.0} us. RPC latency {198.0 50th, 583.6 99th, 892.4 99.9th}. Timely rate 41.9 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4519), TX 37.91 (4519)} Gbps (IOPS). Retransmissions 199. Packet RTTs: {-1.0, -1.0} us. RPC latency {202.0 50th, 531.9 99th, 881.6 99.9th}. Timely rate 59.0 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4639), TX 38.92 (4639)} Gbps (IOPS). Retransmissions 199. Packet RTTs: {-1.0, -1.0} us. RPC latency {196.4 50th, 498.1 99th, 793.3 99.9th}. Timely rate 61.7 Gbps. Credits 32 (best = 32).
Increase --concurrency
to 16
, retransmission will be more serious.
large_rpc_tput: Thread 0: Tput {RX 0.00 (4503), TX 37.78 (4503)} Gbps (IOPS). Retransmissions 1391. Packet RTTs: {-1.0, -1.0} us. RPC latency {1931.0 50th, 3071.7 99th, 3455.7 99.9th}. Timely rate 81.8 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4414), TX 37.03 (4414)} Gbps (IOPS). Retransmissions 1389. Packet RTTs: {-1.0, -1.0} us. RPC latency {1948.4 50th, 3182.2 99th, 3711.9 99.9th}. Timely rate 35.6 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4473), TX 37.53 (4473)} Gbps (IOPS). Retransmissions 1390. Packet RTTs: {-1.0, -1.0} us. RPC latency {1943.4 50th, 3070.6 99th, 3344.7 99.9th}. Timely rate 48.8 Gbps. Credits 32 (best = 32).
large_rpc_tput: Thread 0: Tput {RX 0.00 (4570), TX 38.34 (4570)} Gbps (IOPS). Retransmissions 1388. Packet RTTs: {-1.0, -1.0} us. RPC latency {1902.9 50th, 3003.2 99th, 4543.0 99.9th}. Timely rate 80.3 Gbps. Credits 32 (best = 32).
Judging from the results, it is not as good as this friend’s test result. I would like to ask what suggestions you have for this? Did I miss something?
Besides that, I want to create more sessions on one rpc, and I have two questions about this:
Thanks a lot for your help! cc @anujkaliaiitd
Hi, thanks for sharing the numbers. It's great to know that the code works through KVM and on CentOS :).
To tune single-flow tput, you can allow more credits per flow. That's a static limit on the number of outstanding packets per flow, which can limit perf. https://github.com/erpc-io/eRPC/blob/094c17c3cd9b48bcfbed63f455cc85b9976bd43f/src/sm_types.h#L11
You can also try disabling congestion control by setting this to false: https://github.com/erpc-io/eRPC/blob/094c17c3cd9b48bcfbed63f455cc85b9976bd43f/src/tweakme.h#L16
I don't know why you're seeing packet loss, which shouldn't happen with a single flow. You could try building eRPC with DLOG_LEVEL=trace
and sift through the timeout logs output in /tmp
.
The post you've linked to seems to use bare-metal NICs, whereas you're using KVM. This can effect performance I assume.
You can create lots of sessions with one Rpc object without changing kNumRxRingEntries
. If create_session
starts to fail, you can try bumping the RX ring entries.
Unused sessions don't affect performance.
SR-IOV
, one is 100Gb/s, the other one is 25Gb/srdma_core
version andDPDK
versionCentOS Linux release 8.5.2111/4.18.0-348.7.1.el8_5.x86_64
I ran into two problems, first I chose to use the dpdk side way with
cmake . -DPERF=OFF -DTRANSPORT=dpdk
Compilation went fine, but I have problems running
hello_server
:Then when I use the Roce pattern with
cmake . -DPERF=off -DTRANSPORT=infiniband -DROCE=on
In this way, hello_client & hello_server can be run normally. But get warning:
Modded driver unavailable. Performance will be low.
My goal is to hit the full 100Gb bandwidth in the production environment. So
/lib64/libmlx5.so.1: version
MLX5_1.21' not found`;Modded driver unavailable. Performance will be low
to get the best performance;@anujkaliaiitd Looking forward to your reply, thank you very much!