Closed yiwenzhang92 closed 5 years ago
Hi Yiwen. Thanks for trying eRPC out.
The "Rate = [0.12, 0.12, 0.24, 0.28 Gbps]" shows rate percentiles for the thread's sessions. These are computed by Timely, and they are indeed very small.
Here are a few suggestions:
kEnableCc = false
in tweakme.h
. In my experience, Timely usually works, but it is admittedly the least robust component of eRPC. I expect DCQCN (not implemented yet) to be more robust.num_processes = 11
, and num_threads = 1
. In addition, the 11 nodes were under the same switch (see https://github.com/efficient/eRPC/blob/master/scripts/topo/ibnet_proc.sh
).Thanks for your quick reply Anuj.
With r320 machines, I can indeed see improvement in performance. With a single-thread 2-node setting and 4-node setting, I get ~1.6 Mrps and ~2.6 Mrps respectively. I can see the performance will keep increasing if I add more nodes. With modded driver the numbers might also improve. I don't see much difference when I disable Timely.
As for C6220, disabling Timely doesn't help. I think the Timely implementation is robust. The C6220 itself might have its own issues.
Thanks for the great help again.
Best, Yiwen
Hi Anuj,
I tried eRPC small_rpc_tput but got low performance in per-thread throughput (~0.4 or ~0.5 Mrps vs. ~4 Mrps mentioned in the eRPC paper) and I'm wondering where I did wrong. I'll post my setting and app config below.
First, my experiment setting (CloudLab Apt Utah, C6220 node): NIC: Connect-X3 pro (InfiniBand) driver: MLXN_OFED 4.2-1.0.0.0 OS: ubuntu-14.04 gcc: 8.1.0 cmake: 3.12.0 gflags: 2.2.1 glogs: 0.3.5 hugepages and shared memory are configured I configure cmake using cmake . -DPERF=ON -DTRANSPORT=infiniband I'm using the lasted commit of the master branch.
Now the small_rpc_tput configuration: --test_ms 20000 --sm_verbose 0 --batch_size 5 --concurrency 60 --msg_size 32 --num_processes 2 --num_threads 4 --numa_0_ports 0 --numa_1_ports 1,3
And a snapshot of the printout: Process 0, thread 0: 0.488 Mrps, re_tx = 0, still_in_wheel = 0. RX: 487K resps, 487K reqs. Resps/batch: min 8K, max 8K. Latency: N/A. Rate = [0.12, 0.12, 0.24, 0.28 Gbps]. Process 0, thread 1: 0.485 Mrps, re_tx = 0, still_in_wheel = 0. RX: 484K resps, 487K reqs. Resps/batch: min 8K, max 8K. Latency: N/A. Rate = [0.12, 0.12, 0.16, 0.24 Gbps]. Process 0, thread 3: 0.486 Mrps, re_tx = 0, still_in_wheel = 0. RX: 485K resps, 485K reqs. Resps/batch: min 8K, max 8K. Latency: N/A. Rate = [0.12, 0.12, 0.16, 0.99 Gbps]. Process 0, thread 2: 0.485 Mrps, re_tx = 0, still_in_wheel = 0. RX: 485K resps, 485K reqs. Resps/batch: min 8K, max 8K. Latency: N/A. Rate = [0.12, 0.12, 0.21, 0.46 Gbps].
I tried to debug this issue, and here are my current findings: (1) The Timely Rate number seems to be small. What does this rate mean and is Timely working properly? (2) Increasing batch_size and adjusting concurrency does not affect the performance. (3) Increasing the number of processes does not affect the performance. (I tried both 2-node and 4-node setting and got the same results). (4) Applying the modded MLNX driver you provided does not change the performance. (5) Making num_threads = 1 instead of 4 will make the Timely rate = [56.0, 56.0, 56.0, 56.0], and per thread tput becomes ~0.9 Mrps.
It would be great if you could give some hints on how to debug this issue. Thanks!
Best, Yiwen