Closed Kazurin-775 closed 2 years ago
@Kazurin-775 Good catch! thank you for your pull request.
I think the current solution still has problems, the tokio task reqeust-sender
is still running after the error has been handled so that the new request can be accept and will be hanging forever.
How about simply abort the tokio task like this
let reqeust_sender_handler = tokio::spawn(...);
...
reqeust_sender_handler.abort();
break;
@Kazurin-775 I see your emoji, do you agree with me, will you update the pr?
@Tim-Zhang I think that's a good point, but personally I'm against simply aborting the sender task, since there may be some rare race condition where some RPC requests already in the send queue may get dropped. It should be better to do a graceful shutdown (with proper synchronization), just like what the server part does.
By the way, in our further tests, we discovered that when the server side actively closes the connection in an idle state, a socket error will also be triggered, causing confusion to the users (since no error actually happens in RPC requests). Therefore, either the packet receive logic needs to be redesigned to avoid the error, or we should take care of the special case where there is no RPC pending when an error occurs.
These problems are harder than I previously thought, and maybe it's quite hard to solve them all in a single PR. Such modifications require careful consideration because of the complexity of the packet send logic.
I have pushed another patch that tries to address the 2 problems. But this new solution is rather opinionated, and it sacrifices some performance for better error handling. It also further complicates the send logic and thus requires more tests and validation.
@Kazurin-775 Sorry for the late response, I have tested your code, it's good and worked, but I think it‘s a bit complicated.
I still think the solution abort the sender
is good enough and graceful shutdown is not necessary when the server down.
I still think the solution
abort the sender
is good enough
What about messages already in the RequestReceiver
's buffer? If there are any, they won't get processed and will hang forever. If we want this solution to work, the buffer size of the channel should be set to 0
, at the cost of some performance.
EDIT: this also applies to requests pending at writer.write_all(&buf).await
, so setting the channel buffer size alone may not be sufficient.
It seems that aborting the sender causes the RequestReceiver
to be dropped, which in turn drops all resp_tx
instances in the queue, causing the following code to fail with an error:
The same applies to all resp_tx
instances in req_map
, since the termination of both tokio tasks causes req_map
's reference count to drop to 0.
Therefore, it turns out that aborting the sender actually won't cause unexpected hangs of RPC requests. I will update the PR to use this approach.
@lifupan PTAL
Previously, socket errors (such as connection reset or early EOF) were not properly handled in
ttrpc::r#async::Client
: https://github.com/containerd/ttrpc-rust/blob/e5a5373ca13270b608457944be231ccc71adf756/src/asynchronous/client.rs#L137-L140If such errors were occurred during an RPC, the resources associated with the RPC (i.e. its sender channel) will become leaked (i.e. never being used or dropped), causing the RPC to hang forever, which is not what we want in an erroneous condition.
This patch tries to properly deliver socket errors to all pending RPC callers, as well as add some log messages useful for debugging.
Note that there are some side effects caused by this patch:
ttrpc::Error
type now becomesClone
able, since we want to deliver the error to all pending RPC requests (instead of only one of them).ttrpc::r#async::Client
now usestokio::sync::Mutex
(instead ofstd::sync::Mutex
) for mutexes, which ~works better with coroutines, as well as~ allows theMutexGuard
type to be used acrossawait
s (thestd::sync::MutexGuard
is notSend
), but is slightly heavier thanstd::sync::Mutex
according to tokio's docs.I'm not 100% sure this patch is the best way to solve the problem, so all feedbacks are welcome.