Getting lower throughput than expected

remajin commented 5 years ago

Hi, I'm running the provided small_rpc_tput app, but I'm not getting the result that I expected based on the paper. Following is a screen shot for when I run it on two machines with 11 threads on each one. Batch size is 3 and concurrency is 60.

Each thread is serving about half million requests which is much lower than expected 5 million. Can you please share any insight as to why I'm getting such low throughput?

I'm running this experiment on ubuntu 18.04, kernel version "4.15.0" on "Intel(R) Xeon(R) Gold 5120 CPU @ 2.2" machines with 93 GB memory. Both machines have 56 cores and "MT27800 Family [ConnectX-5]" NICs running on latest version of Mellanox ofed drivers from their website.

anujkaliaiitd commented 5 years ago

Thanks for raising the issue.

It seems that for some reason eRPC is detecting congestion in the network, so sessions are getting rate-limited. The output shows that several sessions are limited to 0.12 Gbps.

Can you try re-running the experiment after recompiling eRPC with kEnableCc = false in tweakme.h? That will disable eRPC's congestion control.

remajin commented 5 years ago

Thanks for responding to this.

This is what I get after re-running the experiment.

remajin commented 5 years ago

@anujkaliaiitd Even after re-running I am not getting expected throughput.

anujkaliaiitd commented 5 years ago

Hi. What's the link speed of your InfiniBand network?

remajin commented 5 years ago

Both machines have 100GBE NICs directly connected by a cable.

anujkaliaiitd commented 5 years ago

Thanks! I actually have 100 Gbps InfiniBand in our lab. I will try this out in the coming days and let you know.

Please note that the small RPC throughput experiment in the paper used 11 machines (instead of two as in this issue). We typically get lower throughput with fewer machines (see, for example, https://github.com/erpc-io/eRPC/issues/35).

The throughput in your experiment setup is a fair bit lower than my expectation, so I am going to look into it.

remajin commented 5 years ago

Thanks.

anujkaliaiitd commented 5 years ago

I ran small_rpc_tput on our cluster with two 14-core CPUs connected through 100 Gbps InfiniBand with a switch. I changed num_threads in app config to 14. Congestion control is enabled. Output sample:

Process 0, thread 10: 2.608 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2607K resps, 2712K reqs. Resps/batch: min 43K, max 43K. Latency: N/A. Rate = [28.24, 59.92, 100.00, 100.00 Gbps].
Process 0, thread 7: 2.535 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2535K resps, 2712K reqs. Resps/batch: min 42K, max 42K. Latency: N/A. Rate = [30.00, 61.16, 100.00, 100.00 Gbps].
Process 0, thread 9: 2.607 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2607K resps, 2712K reqs. Resps/batch: min 43K, max 43K. Latency: N/A. Rate = [19.27, 35.32, 100.00, 100.00 Gbps].
Process 0, thread 6: 2.572 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2572K resps, 2709K reqs. Resps/batch: min 42K, max 42K. Latency: N/A. Rate = [20.05, 20.13, 100.00, 100.00 Gbps].
Process 0, thread 5: 2.678 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2677K resps, 2709K reqs. Resps/batch: min 44K, max 44K. Latency: N/A. Rate = [28.81, 37.99, 100.00, 100.00 Gbps].
Process 0, thread 4: 2.548 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2548K resps, 2690K reqs. Resps/batch: min 42K, max 42K. Latency: N/A. Rate = [100.00, 100.00, 100.00, 100.00 Gbps].
Process 0, thread 3: 2.554 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2553K resps, 2691K reqs. Resps/batch: min 42K, max 42K. Latency: N/A. Rate = [34.20, 100.00, 100.00, 100.00 Gbps].
Process 0, thread 0: 2.375 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2375K resps, 2699K reqs. Resps/batch: min 39K, max 39K. Latency: N/A. Rate = [50.35, 100.00, 100.00, 100.00 Gbps].
Process 0, thread 13: 2.511 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2510K resps, 2690K reqs. Resps/batch: min 41K, max 41K. Latency: N/A. Rate = [25.96, 100.00, 100.00, 100.00 Gbps].
Process 0, thread 8: 2.554 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2554K resps, 2694K reqs. Resps/batch: min 42K, max 42K. Latency: N/A. Rate = [14.50, 100.00, 100.00, 100.00 Gbps].
Process 0, thread 1: 2.488 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2488K resps, 2695K reqs. Resps/batch: min 41K, max 41K. Latency: N/A. Rate = [16.14, 53.73, 100.00, 100.00 Gbps].
Process 0, thread 11: 2.670 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2669K resps, 2686K reqs. Resps/batch: min 44K, max 44K. Latency: N/A. Rate = [11.79, 100.00, 100.00, 100.00 Gbps].
Process 0, thread 2: 2.393 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2393K resps, 2700K reqs. Resps/batch: min 39K, max 39K. Latency: N/A. Rate = [11.52, 12.09, 55.80, 100.00 Gbps].
Process 0, thread 12: 2.572 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2572K resps, 2692K reqs. Resps/batch: min 42K, max 42K. Latency: N/A. Rate = [9.21, 15.35, 100.00, 100.00 Gbps].
Process 0, thread 10: 2.576 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2576K resps, 2694K reqs. Resps/batch: min 42K, max 43K. Latency: N/A. Rate = [14.51, 21.34, 100.00, 100.00 Gbps].
Process 0, thread 7: 2.511 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2510K resps, 2694K reqs. Resps/batch: min 41K, max 41K. Latency: N/A. Rate = [15.91, 28.53, 100.00, 100.00 Gbps].
Process 0, thread 9: 2.581 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2581K resps, 2689K reqs. Resps/batch: min 42K, max 43K. Latency: N/A. Rate = [11.68, 60.76, 100.00, 100.00 Gbps].
Process 0, thread 6: 2.540 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2539K resps, 2693K reqs. Resps/batch: min 42K, max 42K. Latency: N/A. Rate = [16.92, 31.52, 100.00, 100.00 Gbps].
Process 0, thread 5: 2.654 Mrps, re_tx = 0, still_in_wheel = 0. RX: 2653K resps, 2690K reqs. Resps/batch: min 44K, max 44K. Latency: N/A. Rate = [19.82, 20.74, 100.00, 100.00 Gbps]

This is around 2.5 million requests per second (Mrps) per thread, so 35 Mrps total for one machine. That's close to the PCIe 3.0 x16 limit I think.

Can you verify that in your setup, the NIC is connected to socket 0? Please also attach your CMake output, as well as the first 15 lines of the output of small_rpc_tput.

remajin commented 5 years ago

These are the results. In my case machines are directly connected via a cable. There is no switch in between.

anujkaliaiitd commented 5 years ago

Please run cmake with DPERF=ON and DLOG_LEVEL=info. For now, please disable congestion control.

remajin commented 5 years ago

With DPERF=ON it gave me this error.

Once I commented that line this the result.

anujkaliaiitd commented 5 years ago

Thanks. Here are some suggestions.

Pull the latest eRPC, which fixes the unused variable error.
Run cmake with DLOG_LEVEL=info, and attach the first 15 lines of the output of small_rpc_tput.
Please verify that the NIC is connected to socket 0.

remajin commented 5 years ago

With DLOG_LEVEL=info this is the result.

The NIC is connected to numa_node 1, if that's what you mean by socket. That's why in command line arguments I give numa_node 1. Would connecting the NIC to numa_node 0 increase performance?

anujkaliaiitd commented 5 years ago

Running with numa_node 1 in the script should give the best performance in your setup. The performance you are getting is low but not terrible. This might be challenging for me to fix without access to a similar cluster.

It seems your NICs are configured in Ethernet mode. I don't have access to a 100 Gbps Ethernet cluster, but I have a 100 Gbps InfiniBand cluster. Is it possible for you to configure your NICs in InfiniBand mode instead (Mellanox's "VPI" NICs allow this, and I have done this successfully for ConnectX-5 in the past), then compile with DTRANSPORT=infiniband?

remajin commented 5 years ago

These are the servers from my lab, but I'll see if I can do that and let you know.

remajin commented 5 years ago

Turns out the NICs we have are ethernet only, so I won't be able to run it in InfiniBand mode. Is there any other optimization I can do?

erpc-io / eRPC

Getting lower throughput than expected #32