eRPC enqueue, run_event_loop, and response(s)

psistakis commented 2 years ago

Hello,

First of all, thank you for providing the source code for eRPC as well as maintaining it.

I have a few questions regarding eRPC that I could not find in other issues (hopefully I haven't miss anything).

Is there a way to send an eRPC request and poll for the response later? (my understanding is run_event_loop() is responsible -among other things- to do both the send and the receive part.)
Is it allowed to send an eRPC request without the receiver having to enqueue a response?
Is it okay to enqueue on purpose two responses for one eRPC request? In that case, would someone have to capture the first response using run_event_loop() and (additionally) calling e.g., run_event_loop_once() for the second response?

Thank you.

anujkaliaiitd commented 2 years ago

Hi Antonis:

enqueue_request will, in the common case, place some or all packets of the request on the wire. If the request window (by default 8 requests) is full, or if the connection is congested, the request will be queued (see https://github.com/erpc-io/eRPC/blob/d35a86dcf92757b77ff187f15f7bf67a4ebc0221/src/rpc_impl/rpc_req.cc#L70) and subsequently dequeued by the event loop.
In the current implementation, the receiver must send a response. The response acts as an implicit ACK so this behavior is difficult to alter.
The current implementation does not allow multiple responses to a request (although this is a useful feature that people have asked for). Each request currently has a unique request number that's matched in the response. If the client receives two responses, it will assume that the second response's packets are duplicates and drop them.

psistakis commented 2 years ago

Hi Dr Kalia,

Thanks for the response --I appreciate it.

Best wishes and kind regards, Antonis

psistakis commented 2 years ago

Hi Dr Kalia,

I have a few more questions related to this issue:

When I send two requests (back to back) and only then call run_event_loop() I receive only one response --is this expected? Based on your answer above:

In the current implementation, the receiver must send a response. The response acts as an implicit ACK so this behavior is difficult to alter.)

shouldn't we expect two responses?

Is it possible for a thread to send (enqueue) an eRPC request, then call run_event_loop(X) in the background, and continue doing useful work (is this happening by default)? Or is it necessary for the execution on that particular thread (sender) to be stalled for X milliseconds?
Furthermore, is there a way for someone to avoid waiting X milliseconds when calling run_event_loop(X) and instead return when e.g. a response is received?
When an eRPC request arrives will the registered function be invoked from the same (one) thread that is busy waiting with run_event_loop(x) in the server side? or will a separate thread be invoked to serve the request? --if that is the case, will it be the same thread in each invocation or a different one?

Please feel free to correct me if I have misunderstood something.

Thank you.

anujkaliaiitd commented 2 years ago

It should result in two responses, else there's probably a bug in either eRPC or the application code. Are you running the event loop for a long enough duration?
This might work if you never concurrently access an Rpc object from different threads. But it also might not since the code unfortunately uses some thread-local variables that can cause issues. Another approach could be to call run_event_loop_once(), which just runs one iteration of the event loop and returns immediately if there's no Rx/Tx work to be done.
run_event_loop_once() might be an option
You can register a request handler to run in eRPC's "background threads" at the server, provided that some dedicated background threads are launched when constructing the server-side Nexus object. Please see https://github.com/erpc-io/eRPC/blob/d35a86dcf92757b77ff187f15f7bf67a4ebc0221/apps/masstree_analytics/masstree_analytics.cc#L409 for an example.

psistakis commented 2 years ago

Hello Dr Kalia.

I run the event loop for 200 milliseconds.

Case A: Using the code below, I receive two responses: enqueue_request(); cl_cntx->rpc_->run_event_loop(200); enqueue_request(); cl_cntx->rpc_->run_event_loop(200);
Case B: Using the code below, I receive only one response: enqueue_request(); enqueue_request(); cl_cntx->rpc_->run_event_loop(200);

Am I missing something? For case B, I also tried changing the loop event time from 200ms to 500ms (and 1000ms), but with no luck so far when it comes to the number of responses received.

** enqueue_request() input parameters are passed as here (https://github.com/erpc-io/eRPC/blob/d35a86dcf92757b77ff187f15f7bf67a4ebc0221/apps/latency/latency.cc#L113).

Just to make sure I undestand --are you suggesting the following (or something similar)?: a) Enqueue a new eRPC request b) Call run_event_loop_once() to send the request c) Create/allow another thread to do some other work in the local node (in the background) d) Then, call run_event_loop(200) to busy wait for the response Would the above work using eRPC?
I checked the code of run_event_loop_once(). Assuming someone needs a method that returns immediately after a response is received (and handled): Do you think it would be okay eRPC-wise for someone to modify the code and exit the event loop (run_event_loop(X) instead of run_event_loop_once()) when process_comps_st() (https://github.com/erpc-io/eRPC/blob/d35a86dcf92757b77ff187f15f7bf67a4ebc0221/src/rpc_impl/rpc_rx.cc#L6) finishes, instead of busy waiting until the end of the X milliseconds?
Thank you for pointing out this example.

Thanks.

anujkaliaiitd commented 2 years ago

Both ways should result in two responses, else there's a bug somewhere. To debug this, you can try rebuilding eRPC with cmake . -DLOG_LEVEL=trace. Then run the application and paste the contents of /tmp/erpc_trace*.
"Assuming someone needs a method that returns immediately after a response is received (and handled):" The way I do this is to repeatedly call run_event_loop_once(), and set a flag in the client's continuation function that causes the calls to run_event_loop_once() to stop.

psistakis commented 2 years ago

Hello.

Regarding point 1:

Both ways should result in two responses, else there's a bug somewhere. To debug this, you can try rebuilding eRPC with cmake . -DLOG_LEVEL=trace. Then run the application and paste the contents of /tmp/erpc_trace*.

, I would like to first share some information of the platform:

DTRANSPORT: InfiniBand
NIC model: Mellanox ConnectX-4 (if more details are needed, please let me know)
Driver: MLNX_OFED_LINUX-5.4-1.0.3.0
OS: Ubuntu 4.15.0-154

Please find below the contents of /tmp/erpc_trace* (I have replaced server name with 'server'). I ran an application in which node1 sends two (2) requests, as mentioned above, that are received from node2, but node1 receives only one response. I provide the trace generated by node1 (if needed I can do the same for node2 but it might be a bit more complicated because at the same time node2 also sends two requests to node1 --not shown below).

36:517538 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 8, pktn 0, msz 8, magic 11]. Slot [num_tx 0, num_rx 0]. 36:517596 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 0, num_rx 0]. 36:517712 TRACE: Rpc 1, lsn 0 ('server'): RX [type RESP, dsn 0, reqn 8, pktn 0, msz 1112, magic 11]. 36:522609 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:522619 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:527608 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:527614 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:532608 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:532613 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:537608 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:537613 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:542608 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:542613 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:547608 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:547613 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:552609 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:552613 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:557609 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:557614 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:562609 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:562614 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:567609 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:567614 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:572609 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:572614 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:577609 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:577614 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:582609 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:582614 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:587609 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:587614 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:592609 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:592615 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:597610 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:597616 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:602610 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:602615 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:607610 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:607615 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:612610 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:612615 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:617610 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:617615 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:622610 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:622615 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:627610 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:627615 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:632610 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:632615 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:637610 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:637615 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:642610 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:642615 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:647611 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:647616 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:652611 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:652616 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:657611 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:657619 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:662611 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:662616 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:667611 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:667616 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:672611 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:672616 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:677611 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:677618 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:682611 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:682618 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:687611 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:687616 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:692611 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:692616 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:697612 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:697616 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:702612 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:702617 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:707612 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:707617 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0]. 36:712612 REORD: Rpc 1, lsn 0 ('server'): Pkt loss suspected for req 9 ([num_tx 1, num_rx 0]). Action: Retransmitting requests. 36:712617 TRACE: Rpc 1, lsn 0 ('server'): TX [type REQ, dsn 1, reqn 9, pktn 0, msz 8, magic 11]. Slot [num_tx 1, num_rx 0].

Regarding 2.

"Assuming someone needs a method that returns immediately after a response is received (and handled):" The way I do this is to repeatedly call run_event_loop_once(), and set a flag in the client's continuation function that causes the calls to run_event_loop_once() to stop.

That's a good point. In fact I was already using this but I thought I should ask about modifying run_event_loop() in case it can be easily done, in order to avoid potentially adding different flags for different requests.

Thank you for your help.

anujkaliaiitd commented 2 years ago

Thanks! I'll look into the trace.

I should say that several of eRPC's sample applications issue multiple pending requests before polling for a response (e.g., https://github.com/erpc-io/eRPC/blob/d35a86dcf92757b77ff187f15f7bf67a4ebc0221/apps/large_rpc_tput/large_rpc_tput.cc#L155). It might be useful to try these.

psistakis commented 2 years ago

Thanks --I will also take a look at the examples. One more thing that I have noticed is that when enabling traces for debugging it always appears to miss one response (especially when using 200 as parameter for run_event_loop --when removing the 'debugging with traces', or increasing the time, e.g., to 1000, sometimes it happens, but not always). I can try increasing the time (>1000) and let you know.

Also, (and sorry for going back and forth): I just started experimenting again with enqueue_request and I was re-reading your answer above:

enqueue_request will, in the common case, place some or all packets of the request on the wire. If the request window (by default 8 requests) is full, or if the connection is congested, the request will be queued (see eRPC/src/rpc_impl/rpc_req.cc Line 70 in d35a86d if (likely(session->clientinfo.credits_ > 0)) { ) and subsequently dequeued by the event loop.

If I could use an example to understand better: Assuming we have a client that sends only one (1) request and there are no pending requests (nor congestion), would what you say mean that calling only enqueue_request is sufficient for the request to be sent to the server --i.e., without calling run_event_loop?

Thank you.

anujkaliaiitd commented 2 years ago

"Assuming we have a client that sends only one (1) request and there are no pending requests (nor congestion), would what you say mean that calling only enqueue_request is sufficient for the request to be sent to the server --i.e., without calling run_event_loop?"

enqueue_request will place the packets on the wire (see https://github.com/erpc-io/eRPC/blob/d35a86dcf92757b77ff187f15f7bf67a4ebc0221/src/rpc_impl/rpc_req.cc#L71). The packets will likely reach the server (if they're not lost). The client will need to run the event loop to receive the response.

psistakis commented 2 years ago

Thanks. I tried the scenario I described above but without running the event loop the request would not reach the remote server --I will try again, checking if it has to do with the number of packets sent. In the scenario they are lost, would running the event loop help for the recovery/retransmission?
Regarding the issue above with the lost responses: I tried increasing the timeout (even beyond 2000ms), but even then the issue sometimes occurs, but not always. Also, I have noticed that this problem always occurs when I have the trace enabled.
Another question I was having these days: Is there a way for a server to send a response to a client's eRPC request, and then send an eRPC request back to the client from the same registered function? Is there an example with that? I checked the apps folder, but I did not manage to find one. (update: I thought this question might require a different issue topic-wise, so I added it here: #79)

psistakis commented 2 years ago

Hello Dr @anujkaliaiitd .

Did you maybe have a chance to take a look at the trace?

I still have the problem of enqueuing two (2) requests that receive only (1) response after running the event loop.

Additionally, I have noticed that if I do the following:

enqueue request A
run event loop (once)
enqueue request B
run event loop (once)

then sometimes request A is sent twice (and gets only one response). Is this expected, and if so how could it be avoided?

I would appreciate any feedback on how I could debug this and if you think it is an issue in my code, or if there is anything that I could try to fix these issues.

Thanks.

P.S. Both issues appear more easily (i.e., a small number of requests is sufficient, e.g. 2) in a multi-threaded environment (i.e., 2 client - and 2 server threads for the same nexus per node).

psistakis commented 2 years ago

I do not know if it is related to the second problem (duplicate) I mentioned above, but here is another example:

3 nodes (1 client thread each) send 1 eRPC request to each other --so each node sends 2 eRPC requests, the way described above (enq, run_event_once(), enq, run_event_once()).

Node 1: receives Node 2's request twice, but responds both to Node 2 and Node 3 (I don't understand how this can happen) Node 2: receives Node 1's and Node 3's requests, and responds both to Node 1 and Node 3 (normal) Node 3: receives Node 1's and Node 2's requests, and responds both to Node 1 and Node 2 (normal)

anujkaliaiitd commented 2 years ago

Hi. I think it'll be best if you can create minimal examples (similar to the hello_world application) that reproduce these issues. The communication patterns you've described have been used successfully in various eRPC applications (see the apps folder), so your issues indicate either a bug in eRPC, or a bug in the application logic.

psistakis commented 2 years ago

Hello Dr Kalia.

Thanks for the response.

My understanding, and please correct me if I am wrong, is that all applications in the apps folder do not have "server" threads being able to send their own eRPC requests (only "client" threads are able to do that) --that is one key difference compared to my application.

Could it be that when both a client and a server thread of a node concurrently process eRPC requests (e.g., enqueue) they might lead to unpredictable results? E.g., one unpredictable outcome would be a client thread dequeuing (=receiving) the same eRPC request twice (which is what I sometimes see)?

anujkaliaiitd commented 2 years ago

There are a few applications that do this in eRPC:

There's a unit test for the case where a server issues a request from within a request handler: https://github.com/erpc-io/eRPC/blob/master/tests/client_tests/req_in_req_func_test.cc#L3
The smr Raft app also does this: A server thread gets requests from clients, and the server thread also issues requests to backups.

psistakis commented 2 years ago

Hello.

Thanks for the pointers.

I reviewed them and I did not see any significant differences compared to my erpc calls.

Adding increased timeout seemed it was sufficient to start having "simple" eRPC being sent and responded in all 3 nodes (before that I was using run_event_loop_once, or run_event_loop with smaller timeout --working with only 2 nodes there was no such issue).

However, when having more sophisticated registered functions, which for example busy wait for a while or use atomic operations, the eRPC code occasionally seg faults in line https://github.com/erpc-io/eRPC/blob/d35a86dcf92757b77ff187f15f7bf67a4ebc0221/src/nexus_impl/nexus_bg_thread.cc#L32 . I have checked and this issue is not related to the number of background threads being "exhausted" because of busy waiting.

I am using 3 nodes, and 3 background threads in each node. Each nodes sends 50 eRPC requests to the two other nodes. The problem I mention above occurs occasionally, and when it does it is typically in one node (e.g. node 3) when sending eRPC requests (after having successfully sent a few) or responses.

Could you please share your thoughts on why sometimes there could be a seg fault in https://github.com/erpc-io/eRPC/blob/d35a86dcf92757b77ff187f15f7bf67a4ebc0221/src/nexus_impl/nexus_bg_thread.cc#L32 ??

Thanks.

erpc-io / eRPC

eRPC enqueue, run_event_loop, and response(s) #77