erpc-io / eRPC

Efficient RPCs for datacenter networks
Other
861 stars 140 forks source link

Questions regarding eRPC's thread model #102

Closed jiangxiaosheng closed 1 year ago

jiangxiaosheng commented 1 year ago

Thanks for providing this fast RPC framework! I recently started to program with eRPC and found some necessary clarifications of the thread model are missing as far as I saw which can produce bugs that are hard to trace.

The event loop must run in the same thread where the Rpc endpoint is created, otherwise the rpc request will be just lost, i.e. the server will never receive that request. There is no runtime error reporting such a situation since the check is done in an assert statement in the source code which is ignored in release build.

After fixing this bug, I let a server-side background thread run the event loop, and on the client side, it only runs the event loop after an enqueue_request call as in the hello_world example. However, if I take the same approach as the server-side program, i.e. spawning a background thread busy looping and the client just enqueuing request, the rpc request is lost again.

A simple program that illustrates my approach (I only did minor modifications to client.cc in the hello_world example):

#include "common.h"
#include <thread>
erpc::Rpc<erpc::CTransport> *rpc;
erpc::MsgBuffer req;
erpc::MsgBuffer resp;

void cont_func(void *, void *) { printf("%s\n", resp.buf_); }

void sm_handler(int, erpc::SmEventType, erpc::SmErrType, void *) {}

int main() {
  std::string client_uri = kClientHostname + ":" + std::to_string(kClientUDPPort);
  erpc::Nexus nexus(client_uri);

  int session_num;
  volatile bool ok = false;
  std::thread th([&] {
    rpc = new erpc::Rpc<erpc::CTransport>(&nexus, nullptr, 0, sm_handler);
    std::string server_uri = kServerHostname + ":" + std::to_string(kServerUDPPort);
    session_num = rpc->create_session(server_uri, 0);
    while (!rpc->is_connected(session_num)) rpc->run_event_loop_once();
    ok = true;

    while (ok)
      rpc->run_event_loop_once();
  });

  while (!ok)
    std::this_thread::sleep_for(std::chrono::milliseconds(100));

  req = rpc->alloc_msg_buffer_or_die(kMsgSize);
  resp = rpc->alloc_msg_buffer_or_die(kMsgSize);

  rpc->enqueue_request(session_num, kReqType, &req, &resp, cont_func, nullptr);
  std::this_thread::sleep_for(std::chrono::seconds(10));
  ok = false;
  th.join();

  delete rpc;
}

The way in the client.cc could work. I was just thinking running the event loop in the client's background thread can help to reduce the tail latency because other work like congestion control, re-transmission (if eRPC provides such mechanisms) can be handled in the background.

I guess it might be due to the enqueue_request call is not thread-safe in the user thread. So is it a best practice to run the event loop reactively on the client side as the hello_world example does? I would really appreciate it if anyone can clarify my question.

ankalia commented 1 year ago

Hi Sheng. Thanks for raising this question.

Regarding your use case: the computation logic for congestion control and retransmissions is quite simple and does not significantly contribute to tail latency. If we split these out of other RPC processing into a separate core, the cross-CPU core coordination will hurt far more. If you're interested more in this topic, this paper presents some great material: https://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf.

jiangxiaosheng commented 1 year ago

Thanks Anuj for the quick reponse and clear clarification! Yes I’d like to make a PR adding more comments regarding the thread model