Closed tegefaulkes closed 9 months ago
I don't like the idea of a grace time to resolve this. Can you think of a more deterministic protocol to solve this problem? I thought originally considered this already. That is when the client times out that's a client error, nothing to do with the server. If the server times out that's a server error returned back to the client. If they are the same time timeout then whichever comes first will be the origin of the response. So I disagree with the additional non-deterministic grace time in principle.
The problem is that the client will always go first because it doesn't deal with the network delay. As it stands the server can't gracefully handle a timeout at all if the client always closes the stream at the same time.
It's not really a grace time I'm suggesting, we just need the client side to have a longer timeout to allow the server to respond. It would be as simple as setting a longer default timeout for the client but the timeout middleware overrides that.
Longer shorter is all still non-deterministic. The server should just do a noop if the client times out.
Or we allow the user to pass in a handler for client timeout - and this just logs out a warning message.
The client is allowed to timeout earlier. The server handler is supposed to gracefully terminate without error.
That's all fine. The main problem is we don't have ANY way for an RPC call to respond gracefully with a message after it times out. It defeats the point of the server handler timeout being advisory if the client just closes the stream before the server can respond.
I'd like for the handler to be able to end gracefully after it times out. The only other option is that it always errors. But we can't even provide a useful error since it's the client cancelling the stream.
The server side should have the main agency over the stream, the client timeout should only ever be the last option fail-safe for ending the stream.
I don't understand this. Can you draw a diagram pointing to where the complication is.
Here is a diagram explaining the problem as it is.
And here is what we expected to happen
Note the following details.
This is a good explanation thanks. Let's do more of this in the future.
My initial thoughts are that this is due to 2 separately mixed up semantics of the client timeout parameter.
And because of this mixed up meaning, we have a UI/UX issue where the client side reports a generic client rpc timeout error, but not a ping failure error from the server side.
This can be difficult to solve. Because the solution depends on what we want at the top level.
If I think semantic 1. is the more important semantic, then the current behaviour as is, is correct barring the server erroring out, which it should not.
If I think semantic 2. is the more important semantic, then we would want to follow your diagram.
Is there a third way that might combine these together?
I prefer semantic 2, where the timeout handling is done by the server side. It has more information needed to decide how to handle it. That said, we do need a way for the client to handle bad actors. Where after a certain amount of time we give up on the RPC call and just cancel it.
Combining the two would just be having one time for how long we want the server to take and a 2nd time for how long we give up on the whole call. The difference between these two would be considered a grace time.
In my mind implementing both semantics just means having a slightly longer client fail-safe timeout.
It seems that your proposed solution is give the client a grace timeout of say 1000ms that allows the server to respond with an error. If the server doesn't respond with that error, then the client should timeout with a generic timeout error?
Pretty much, but less so a grace timeout and just have two separate call timeout and kill timeout options. Depends how we want to go about it.
I think just adding 1000 to the internal client timeout is sufficient. That is if the client timeout is 10000, the real client timeout is 11000, but 10000 is communicated to the server via metadata.
Sounds good.
Looking at making the change, it's not quite that simple of just adding 1000 to the client timeout separate from what's sent to the server.
ctx
when making the call. It could be a number or an existing timer when provided to the ctx.So there's no real good place to add the grace time to the timer, the only real option is to subtract the value in the timeout middleware. But subtraction leaves us open to some invalid values. How do we want to handle this?
The main problem is that we end up with negative timeout sent to the server. Minimally we should clamp this to 0 but this means any timeout specified that is less than the grace time could end up with some weird implicit behaviour. For example, if a call is made with a timeout less than the grace timer then the handler will timeout before It can do anything.
I'm not sure how we want to go about this.
Create a new timer from the old timer?
Wrap the method calls?
Just don't mutate the original timer.
Subtraction is not correct here. This is a UI/UX problem. Don't create new UI/UX problems when solving 1 UI/UX problem.
On reflection it should be simple enough to have the timeout trigger a 2nd stage grace timer before cancelling the stream.
I've stumbled onto this really weird addition to the RPCClient
and RPCServer
.
public registerOnTimeoutCallback(callback: () => void) {
this.onTimeoutCallback = callback;
}
// and in the duplex caller.
void timer.then(
async () => {
abortController.abort(timeoutError);
// added here
if (this.onTimeoutCallback) {
this.onTimeoutCallback();
}
},
() => {}, // Ignore cancellation error
);
Looks like it was added by @addievo for testing use. We tend to steer way from adding features like this just for testing so I don't think this should've been included.
This should be removed, there are better less invasive ways to tell if the RPC has timed out during testing.
When the client gets a remote RPC error, we currently show the remote RPC error AND also the casue being from the agent. This provides context that the error is in fact from the remote agent, however it's still a bit verbose, we should figure out how best to present errors... and I'm wondering whether it makes sense to invert it, so we show the deepest error, and then going up the cause chain, rather than going down the cause chain.
This was from Sunday, and I noticed that when the agent hit an error, and of course we get an error RPC remote, then... it was kind of repetitive to keep saying remote error cause any error from the agent is remote. So I imagine rather than:
RemoteError
ActualError
It should be
ActualError
RemoteError
That way the idea we flip the cause chain, and say, this is the "origin" of the error, and it was later wrapped as whatever it is.
So what is the opposite of cause? Like effect
? This might make more sense:
ActualError
then: RemoteError
ActualError
by: RemoteError
ActualError
followed: RemoteError
When the client gets a remote RPC error, we currently show the remote RPC error AND also the casue being from the agent. This provides context that the error is in fact from the remote agent, however it's still a bit verbose, we should figure out how best to present errors... and I'm wondering whether it makes sense to invert it, so we show the deepest error, and then going up the cause chain, rather than going down the cause chain.
This was from Sunday, and I noticed that when the agent hit an error, and of course we get an error RPC remote, then... it was kind of repetitive to keep saying remote error cause any error from the agent is remote. So I imagine rather than:
RemoteError ActualError
It should be
ActualError RemoteError
That way the idea we flip the cause chain, and say, this is the "origin" of the error, and it was later wrapped as whatever it is.
So what is the opposite of cause? Like
effect
? This might make more sense:ActualError then: RemoteError ActualError by: RemoteError ActualError followed: RemoteError
I leaning towards "then".
That seems like a reasonable idea. My first thought was parent
for the chain but that might be a bit to programmer language for a readability.
Specification
Currently this is technically supported where the server handler can just ignore the timeout already. The problem is when the client side timeout just hard closes the stream. This is leading to a problem where a handler that expected to timeout before responding but still respond with a messages is not being allowed to complete by the client side.
For example, we have a
nodesPing
handler that has two expected cases.In case 2 the client is ending the RPC call before the response message is received. Leading to an error message in stead of the expected RPC message indicating ping failure.
This needs to be addressed in the RPC client. We need to add a grace time between the server and client timeout. The middleware that updates the handler timeout needs to account for this as well.
Additional context
Tasks