containerd / ttrpc-rust

Rust implementation of ttrpc (GRPC for low-memory environments)
Apache License 2.0
197 stars 47 forks source link

Improve handling of socket errors in the async client #135

Closed Kazurin-775 closed 2 years ago

Kazurin-775 commented 2 years ago

Previously, socket errors (such as connection reset or early EOF) were not properly handled in ttrpc::r#async::Client: https://github.com/containerd/ttrpc-rust/blob/e5a5373ca13270b608457944be231ccc71adf756/src/asynchronous/client.rs#L137-L140

If such errors were occurred during an RPC, the resources associated with the RPC (i.e. its sender channel) will become leaked (i.e. never being used or dropped), causing the RPC to hang forever, which is not what we want in an erroneous condition.

This patch tries to properly deliver socket errors to all pending RPC callers, as well as add some log messages useful for debugging.

Note that there are some side effects caused by this patch:

  1. The ttrpc::Error type now becomes Cloneable, since we want to deliver the error to all pending RPC requests (instead of only one of them).
  2. The ttrpc::r#async::Client now uses tokio::sync::Mutex (instead of std::sync::Mutex) for mutexes, which ~works better with coroutines, as well as~ allows the MutexGuard type to be used across awaits (the std::sync::MutexGuard is not Send), but is slightly heavier than std::sync::Mutex according to tokio's docs.

I'm not 100% sure this patch is the best way to solve the problem, so all feedbacks are welcome.

Tim-Zhang commented 2 years ago

@Kazurin-775 Good catch! thank you for your pull request.

I think the current solution still has problems, the tokio task reqeust-sender is still running after the error has been handled so that the new request can be accept and will be hanging forever.

How about simply abort the tokio task like this

let reqeust_sender_handler = tokio::spawn(...);

 ...

reqeust_sender_handler.abort();
break;
Tim-Zhang commented 2 years ago

@Kazurin-775 I see your emoji, do you agree with me, will you update the pr?

Kazurin-775 commented 2 years ago

@Tim-Zhang I think that's a good point, but personally I'm against simply aborting the sender task, since there may be some rare race condition where some RPC requests already in the send queue may get dropped. It should be better to do a graceful shutdown (with proper synchronization), just like what the server part does.

By the way, in our further tests, we discovered that when the server side actively closes the connection in an idle state, a socket error will also be triggered, causing confusion to the users (since no error actually happens in RPC requests). Therefore, either the packet receive logic needs to be redesigned to avoid the error, or we should take care of the special case where there is no RPC pending when an error occurs.

These problems are harder than I previously thought, and maybe it's quite hard to solve them all in a single PR. Such modifications require careful consideration because of the complexity of the packet send logic.

Kazurin-775 commented 2 years ago

I have pushed another patch that tries to address the 2 problems. But this new solution is rather opinionated, and it sacrifices some performance for better error handling. It also further complicates the send logic and thus requires more tests and validation.

Tim-Zhang commented 2 years ago

@Kazurin-775 Sorry for the late response, I have tested your code, it's good and worked, but I think it‘s a bit complicated. I still think the solution abort the sender is good enough and graceful shutdown is not necessary when the server down.

Kazurin-775 commented 2 years ago

I still think the solution abort the sender is good enough

What about messages already in the RequestReceiver's buffer? If there are any, they won't get processed and will hang forever. If we want this solution to work, the buffer size of the channel should be set to 0, at the cost of some performance.

EDIT: this also applies to requests pending at writer.write_all(&buf).await, so setting the channel buffer size alone may not be sufficient.

Kazurin-775 commented 2 years ago

It seems that aborting the sender causes the RequestReceiver to be dropped, which in turn drops all resp_tx instances in the queue, causing the following code to fail with an error:

https://github.com/containerd/ttrpc-rust/blob/178648063bfd460e9ae1416645d75f71879689fe/src/asynchronous/client.rs#L164-L167

The same applies to all resp_tx instances in req_map, since the termination of both tokio tasks causes req_map's reference count to drop to 0.

Therefore, it turns out that aborting the sender actually won't cause unexpected hangs of RPC requests. I will update the PR to use this approach.

Tim-Zhang commented 2 years ago

@lifupan PTAL