erpc-io / eRPC

Efficient RPCs for datacenter networks
https://erpc.io/
Other
835 stars 137 forks source link

test failed in server_failure_test&multi_process_test #21

Closed CCrainys closed 5 years ago

CCrainys commented 5 years ago

Hi, Anuj My cluster has 4 mellanox connectx-4 nics: ib0 and ib1 are infiniband nics. p6p1 and p6p2 are ethernet nics.

  mlx5_0 port 1 ==> ib0 (Up)
  mlx5_1 port 1 ==> ib1 (Down)
  mlx5_2 port 1 ==> p6p1 (Up)
  mlx5_3 port 1 ==> p6p2 (Up)

Ofed version is :

  MLNX_OFED_LINUX-4.4-2.0.7.0

Operating system is:

 CentOS Linux release 7.5.1804

I have two questions:

  1. When I run ctest and hello-world, from process output, we known, erpc automatically choose Device mlx5_3/p6p2, What should I do to select another nic? (changing ip doesn't work.)

2.I compile with command "cmake . -DPERF=OFF -DTRANSPORT=raw", then run ctest. However, server_failure_test and multi_process_test failed, the error info is:

Total Test time (real) =  56.93 sec

The following tests FAILED:
  8 - server_failure_test (OTHER_FAULT)
  9 - multi_process_test (OTHER_FAULT)
 Errors while running CTest

I run build/server_failure_test and build/multi_process_test, the error information is:

server_failure_test:

server_failure_test: /root/eRPC/tests/client_tests/server_failure_test.cc:93:
void generic_test_func(erpc::Nexus*, size_t): Assertion `c.num_rpc_resps == config_num_rpcs' failed.Aborted

multi_process_test:

6:070851 WARNG: Installing flow rule for Rpc 0. NUMA node = 0. Flow RX UDP port = 36454.
6:071238 WARNG: RawTransport created for Rpc ID 0. Device mlx5_3/p6p2, port 1. IPv4 
63.63.63.86, MAC ec:d:9a:c5:ba:bd. Datapath UDP port 36454.
......
multi_process_test: /root/eRPC/tests/client_tests/multi_process_test.cc:60: void 
process_proxy_thread_func(size_t, size_t): Assertion `c.num_rpc_resps == num_processes - 1' failed.
Aborted

Looking forward for your reply and thanks in advance.

Best regards Thomas

anujkaliaiitd commented 5 years ago

The NIC port can be changed using the last argument to the Rpc constructor. See docs.

The server failure handing feature isn't ready yet. I've disabled its test for now.

Can you try multi_process_test with a clean initial state (i.e., without server_failure_test having first failed)?

CCrainys commented 5 years ago

Hi, Anuj Thanks for your quick reply. Firstly, I pull your last commit(disable server failure). Then I run ctest. I found that sometimes multi_process_test run successfully, sometimes it fails.

  100% tests passed, 0 tests failed out of 16
  Total Test time (real) =  44.86 sec

or

   Total Test time (real) =  66.34 sec
   The following tests FAILED:
  8 - multi_process_test (OTHER_FAULT)
    Errors while running CTest

when multi_process_test failed, I just run build/multi_process_test alone, the error info is

    Process 4: All sessions connected
    Process 13: All sessions connected
    Process 22: All sessions connected
    multi_process_test: /root/eRPC/tests/client_tests/multi_process_test.cc:60: void 
    process_proxy_thread_func(size_t, size_t): Assertion `c.num_rpc_resps == num_processes - 1' 
   failed.
   Aborted
CCrainys commented 5 years ago

Hi, Anuj

I read code about multi_process_test. I think the problem was caused by variable kMaxNumERpcProcesses and kTestMaxEventLoopMs. After increasing kTestMaxEventLoopMs or decreasing kMaxNumERpcProcesses, multi_process_test works successfully.

My cluster has 28cores, 16 cores were used by other task. I think it might be a multi-process scheduling problem when number of free CPU cores on the system is less than value of kMaxNumERpcProcesses.

The above is my guess. Do you agree with me? Looking forward for your reply.

Best regards Thomas

anujkaliaiitd commented 5 years ago

Sounds right. The hardcoded value of allowed test time (kTestMaxEventLoopMs) isn't good, and I will move to more flexible timing in the future.

CCrainys commented 5 years ago

OK, got it.

thanks

Best Regards, Thomas


From: anujkaliaiitd notifications@github.com Sent: Tuesday, February 26, 2019 1:41:58 AM To: erpc-io/eRPC Cc: Thomas CC; Author Subject: Re: [erpc-io/eRPC] test failed in server_failure_test&multi_process_test (#21)

Sounds right. The hardcoded value of allowed test time (kTestMaxEventLoopMs) isn't good, and I will move to more flexible timing in the future.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/erpc-io/eRPC/issues/21#issuecomment-467106406, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AR1ZlhjDKj2797lyRQNPAKZjZfm2D5_yks5vRCBmgaJpZM4bO-xD.