Scalabilty issue on different numa nodes!

Updated!

I have conducted a set of experiments to test eRPC scalability using the small_rpc_tput application. In the first experiment, I used two different physical servers and increased the number of threads on each server. The system throughput seems to stop increasing when the number of threads exceeds 12.

In the second experiment, I ran both processes on localhost and increased the number of threads in each process. I have included the figures for these experiments. I also ran the latency application on localhost, as well as on two physical servers. The former latency is 3.7 usec, while the latter latency is 4.1 usec. The following two graphs show the result of the experiments. The y-axis is throughput in MRPS, and x-axis is the number of threads per process. perProcess perThread

I'm using InfiniBand as transport protocol, with 100 GB two-port NICs on each server.

The problem and/or question is: why does eRPC have a much lower throughput on localhost? Is this the limitation of eRPC or just the transport layer?

After more investigation, turned out that the localhost performance is similar to the two machines if I run it one numa node one (second socket)! So, the problem is not with localhost, is with numa node 0 (first socket). Surprisingly, the nic is connected to numa node 0 (first socket) PCIe bus! A little bit about the system, each server has two sockets, each with one 64 core AMD EPYC processor. I also ran the ibv_write_bw and ibv_write_lat benchmarks. The result is as expected. If the test is performed using numa node 0, the performance is a little better than when I run the benchmarks on numa node 1.

erpc-io / eRPC

Scalabilty issue on different numa nodes! #68

Updated!