erpc-io / eRPC

Efficient RPCs for datacenter networks
Other
851 stars 138 forks source link

Low Performance using IB with CX-3 #14

Closed yiwenzhang92 closed 5 years ago

yiwenzhang92 commented 5 years ago

Hi Anuj,

I tried eRPC small_rpc_tput but got low performance in per-thread throughput (~0.4 or ~0.5 Mrps vs. ~4 Mrps mentioned in the eRPC paper) and I'm wondering where I did wrong. I'll post my setting and app config below.

First, my experiment setting (CloudLab Apt Utah, C6220 node): NIC: Connect-X3 pro (InfiniBand) driver: MLXN_OFED 4.2-1.0.0.0 OS: ubuntu-14.04 gcc: 8.1.0 cmake: 3.12.0 gflags: 2.2.1 glogs: 0.3.5 hugepages and shared memory are configured I configure cmake using cmake . -DPERF=ON -DTRANSPORT=infiniband I'm using the lasted commit of the master branch.

Now the small_rpc_tput configuration: --test_ms 20000 --sm_verbose 0 --batch_size 5 --concurrency 60 --msg_size 32 --num_processes 2 --num_threads 4 --numa_0_ports 0 --numa_1_ports 1,3

And a snapshot of the printout: Process 0, thread 0: 0.488 Mrps, re_tx = 0, still_in_wheel = 0. RX: 487K resps, 487K reqs. Resps/batch: min 8K, max 8K. Latency: N/A. Rate = [0.12, 0.12, 0.24, 0.28 Gbps]. Process 0, thread 1: 0.485 Mrps, re_tx = 0, still_in_wheel = 0. RX: 484K resps, 487K reqs. Resps/batch: min 8K, max 8K. Latency: N/A. Rate = [0.12, 0.12, 0.16, 0.24 Gbps]. Process 0, thread 3: 0.486 Mrps, re_tx = 0, still_in_wheel = 0. RX: 485K resps, 485K reqs. Resps/batch: min 8K, max 8K. Latency: N/A. Rate = [0.12, 0.12, 0.16, 0.99 Gbps]. Process 0, thread 2: 0.485 Mrps, re_tx = 0, still_in_wheel = 0. RX: 485K resps, 485K reqs. Resps/batch: min 8K, max 8K. Latency: N/A. Rate = [0.12, 0.12, 0.21, 0.46 Gbps].

I tried to debug this issue, and here are my current findings: (1) The Timely Rate number seems to be small. What does this rate mean and is Timely working properly? (2) Increasing batch_size and adjusting concurrency does not affect the performance. (3) Increasing the number of processes does not affect the performance. (I tried both 2-node and 4-node setting and got the same results). (4) Applying the modded MLNX driver you provided does not change the performance. (5) Making num_threads = 1 instead of 4 will make the Timely rate = [56.0, 56.0, 56.0, 56.0], and per thread tput becomes ~0.9 Mrps.

It would be great if you could give some hints on how to debug this issue. Thanks!

Best, Yiwen

anujkaliaiitd commented 5 years ago

Hi Yiwen. Thanks for trying eRPC out.

The "Rate = [0.12, 0.12, 0.24, 0.28 Gbps]" shows rate percentiles for the thread's sessions. These are computed by Timely, and they are indeed very small.

Here are a few suggestions:

yiwenzhang92 commented 5 years ago

Thanks for your quick reply Anuj.

With r320 machines, I can indeed see improvement in performance. With a single-thread 2-node setting and 4-node setting, I get ~1.6 Mrps and ~2.6 Mrps respectively. I can see the performance will keep increasing if I add more nodes. With modded driver the numbers might also improve. I don't see much difference when I disable Timely.

As for C6220, disabling Timely doesn't help. I think the Timely implementation is robust. The C6220 itself might have its own issues.

Thanks for the great help again.

Best, Yiwen