containerd / ttrpc

GRPC for low-memory environments
Apache License 2.0
536 stars 78 forks source link

client: Handle sending/receiving in separate goroutines #94

Closed kevpar closed 2 years ago

kevpar commented 2 years ago

Late last year I created #73 in an attempt to solve #72. This PR had some feedback that I never had a chance to properly address. I've now revisited this area and taken a different approach, which I think is better to put in a fresh PR, so that's what we have here.

Originally I had changed both the client and server code to prevent this deadlock on both ends. However, after thinking on this recently, I think the server code is alright as is. The core issue is that the client would block sending to the server, and the server would block sending back to the client, and thus neither would read from the other and we would deadlock. This is definitely undesirable behavior from the client, but on the server end I think it is reasonable to block reading from a client if it is not draining its end of the connection. So this new approach only changes the client to allow sending and receiving to operate independently.

I intend to close out #73 assuming this new approach is reasonable.

Commit description

Changes the TTRPC client logic so that sending and receiving with the server are in completely independent goroutines, with shared state guarded by a mutex. Previously, sending/receiving were tied together by reliance on a coordinator goroutine. This led to issues where if the server was not reading from the connection, the client could get stuck sending a request, causing the client to not read responses from the server. See [1] for more details.

The new design sets up separate sending/receiving goroutines. These share state in the form of the set of active calls that have been made to the server. This state is encapsulated in the callMap type and access is guarded by a mutex.

The main event loop in run previously handled a lot of state management for the client. Now that most state is tracked by the callMap, it mostly exists to notice when the client is closed and take appropriate action to clean up.

Also did some minor code cleanup. For instance, the code was previously written to support multiple receiver goroutines, though this was not actually used. I've removed this for now, since the code is simpler this way, and it's easy to add back if we actually need it in the future.

[1] https://github.com/containerd/ttrpc/issues/72

Signed-off-by: Kevin Parsons kevpar@microsoft.com

codecov-commenter commented 2 years ago

Codecov Report

Merging #94 (4f0aeb5) into main (77fc8a4) will increase coverage by 0.47%. The diff coverage is 75.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #94      +/-   ##
==========================================
+ Coverage   72.88%   73.35%   +0.47%     
==========================================
  Files          11       11              
  Lines         708      728      +20     
==========================================
+ Hits          516      534      +18     
+ Misses        155      153       -2     
- Partials       37       41       +4     
Impacted Files Coverage Δ
client.go 79.60% <75.34%> (-1.07%) :arrow_down:
server.go 77.65% <0.00%> (+1.46%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 77fc8a4...4f0aeb5. Read the comment docs.

dcantah commented 2 years ago

What did validation for this consist of? Main reason I'm asking is because the sample program in the issue linked takes me to 404 land :/

kevpar commented 2 years ago

What did validation for this consist of? Main reason I'm asking is because the sample program in the issue linked takes me to 404 land :/

That repo changed to private in the last few months. I've put up a newer version of the test tool here: https://github.com/kevpar/ttrpc-deadlock

For validation I've run my test tool (which repros the deadlock) for ~10 minutes and seen no deadlock occur. Other than that the main validation is what we get through CI.