Allow server RPC handler to respond after timeout from the client (so that client can receive a richer error message)

tegefaulkes commented 10 months ago

Specification

Currently this is technically supported where the server handler can just ignore the timeout already. The problem is when the client side timeout just hard closes the stream. This is leading to a problem where a handler that expected to timeout before responding but still respond with a messages is not being allowed to complete by the client side.

For example, we have a nodesPing handler that has two expected cases.

Ping succeeds and returns with success message.
Ping fails and returns with failure message.

In case 2 the client is ending the RPC call before the response message is received. Leading to an error message in stead of the expected RPC message indicating ping failure.

This needs to be addressed in the RPC client. We need to add a grace time between the server and client timeout. The middleware that updates the handler timeout needs to account for this as well.

Additional context

Related: https://github.com/MatrixAI/Polykey-CLI/issues/37

Tasks

Determine the best way to allow time for the server handler to respond after a timeout.
Implement this option.

CMCDragonkai commented 10 months ago

I don't like the idea of a grace time to resolve this. Can you think of a more deterministic protocol to solve this problem? I thought originally considered this already. That is when the client times out that's a client error, nothing to do with the server. If the server times out that's a server error returned back to the client. If they are the same time timeout then whichever comes first will be the origin of the response. So I disagree with the additional non-deterministic grace time in principle.

tegefaulkes commented 10 months ago

The problem is that the client will always go first because it doesn't deal with the network delay. As it stands the server can't gracefully handle a timeout at all if the client always closes the stream at the same time.

It's not really a grace time I'm suggesting, we just need the client side to have a longer timeout to allow the server to respond. It would be as simple as setting a longer default timeout for the client but the timeout middleware overrides that.

CMCDragonkai commented 10 months ago

Longer shorter is all still non-deterministic. The server should just do a noop if the client times out.

Or we allow the user to pass in a handler for client timeout - and this just logs out a warning message.

The client is allowed to timeout earlier. The server handler is supposed to gracefully terminate without error.

tegefaulkes commented 10 months ago

That's all fine. The main problem is we don't have ANY way for an RPC call to respond gracefully with a message after it times out. It defeats the point of the server handler timeout being advisory if the client just closes the stream before the server can respond.

I'd like for the handler to be able to end gracefully after it times out. The only other option is that it always errors. But we can't even provide a useful error since it's the client cancelling the stream.

The server side should have the main agency over the stream, the client timeout should only ever be the last option fail-safe for ending the stream.

CMCDragonkai commented 10 months ago

I don't understand this. Can you draw a diagram pointing to where the complication is.

tegefaulkes commented 10 months ago

Here is a diagram explaining the problem as it is.

And here is what we expected to happen

Note the following details.

The client call and server handler have separate timeouts.
The server handler timeout is reset using the timeout time of the client caller via the timeout middleware.
The server handler timeout is advisory as a signal that can be ignored. Up to the handler to decide what to do. In this case we treat the timeout as a failure to ping and respond with a message.
Client timeout cancels the RPC stream directly witch ends up cancelling the transport stream.

CMCDragonkai commented 10 months ago

This is a good explanation thanks. Let's do more of this in the future.

My initial thoughts are that this is due to 2 separately mixed up semantics of the client timeout parameter.

As a client I no longer care about the result from the server after 10,000ms.
As a client I am telling the server that it only has 10,000ms to do its work and therefore the server should tell me what failed to complete in that 10,000ms.

And because of this mixed up meaning, we have a UI/UX issue where the client side reports a generic client rpc timeout error, but not a ping failure error from the server side.

This can be difficult to solve. Because the solution depends on what we want at the top level.

If I think semantic 1. is the more important semantic, then the current behaviour as is, is correct barring the server erroring out, which it should not.

If I think semantic 2. is the more important semantic, then we would want to follow your diagram.

Is there a third way that might combine these together?

tegefaulkes commented 10 months ago

I prefer semantic 2, where the timeout handling is done by the server side. It has more information needed to decide how to handle it. That said, we do need a way for the client to handle bad actors. Where after a certain amount of time we give up on the RPC call and just cancel it.

Combining the two would just be having one time for how long we want the server to take and a 2nd time for how long we give up on the whole call. The difference between these two would be considered a grace time.

In my mind implementing both semantics just means having a slightly longer client fail-safe timeout.

CMCDragonkai commented 10 months ago

It seems that your proposed solution is give the client a grace timeout of say 1000ms that allows the server to respond with an error. If the server doesn't respond with that error, then the client should timeout with a generic timeout error?

tegefaulkes commented 10 months ago

Pretty much, but less so a grace timeout and just have two separate call timeout and kill timeout options. Depends how we want to go about it.

CMCDragonkai commented 10 months ago

I think just adding 1000 to the internal client timeout is sufficient. That is if the client timeout is 10000, the real client timeout is 11000, but 10000 is communicated to the server via metadata.

tegefaulkes commented 10 months ago

Sounds good.

tegefaulkes commented 10 months ago

Looking at making the change, it's not quite that simple of just adding 1000 to the client timeout separate from what's sent to the server.

The timer can be the default or specified using the ctx when making the call. It could be a number or an existing timer when provided to the ctx.
The timeout middleware gets the timeout value from the timer's delay. We can't know if was set by the default or passed in.

So there's no real good place to add the grace time to the timer, the only real option is to subtract the value in the timeout middleware. But subtraction leaves us open to some invalid values. How do we want to handle this?

The main problem is that we end up with negative timeout sent to the server. Minimally we should clamp this to 0 but this means any timeout specified that is less than the grace time could end up with some weird implicit behaviour. For example, if a call is made with a timeout less than the grace timer then the handler will timeout before It can do anything.

I'm not sure how we want to go about this.

CMCDragonkai commented 10 months ago

Create a new timer from the old timer?

CMCDragonkai commented 10 months ago

Wrap the method calls?

CMCDragonkai commented 10 months ago

Just don't mutate the original timer.

CMCDragonkai commented 10 months ago

Subtraction is not correct here. This is a UI/UX problem. Don't create new UI/UX problems when solving 1 UI/UX problem.

tegefaulkes commented 10 months ago

On reflection it should be simple enough to have the timeout trigger a 2nd stage grace timer before cancelling the stream.

tegefaulkes commented 10 months ago

I've stumbled onto this really weird addition to the RPCClient and RPCServer.

  public registerOnTimeoutCallback(callback: () => void) {
    this.onTimeoutCallback = callback;
  }

    // and in the duplex caller.
    void timer.then(
      async () => {
        abortController.abort(timeoutError);
        // added here
        if (this.onTimeoutCallback) {
          this.onTimeoutCallback();
        }
      },
      () => {}, // Ignore cancellation error
    );

Looks like it was added by @addievo for testing use. We tend to steer way from adding features like this just for testing so I don't think this should've been included.

This should be removed, there are better less invasive ways to tell if the RPC has timed out during testing.

CMCDragonkai commented 9 months ago

When the client gets a remote RPC error, we currently show the remote RPC error AND also the casue being from the agent. This provides context that the error is in fact from the remote agent, however it's still a bit verbose, we should figure out how best to present errors... and I'm wondering whether it makes sense to invert it, so we show the deepest error, and then going up the cause chain, rather than going down the cause chain.

This was from Sunday, and I noticed that when the agent hit an error, and of course we get an error RPC remote, then... it was kind of repetitive to keep saying remote error cause any error from the agent is remote. So I imagine rather than:

RemoteError
  ActualError

It should be

ActualError
  RemoteError

That way the idea we flip the cause chain, and say, this is the "origin" of the error, and it was later wrapped as whatever it is.

So what is the opposite of cause? Like effect? This might make more sense:

ActualError
  then: RemoteError

ActualError
  by: RemoteError

ActualError
   followed: RemoteError

CMCDragonkai commented 9 months ago

When the client gets a remote RPC error, we currently show the remote RPC error AND also the casue being from the agent. This provides context that the error is in fact from the remote agent, however it's still a bit verbose, we should figure out how best to present errors... and I'm wondering whether it makes sense to invert it, so we show the deepest error, and then going up the cause chain, rather than going down the cause chain.

This was from Sunday, and I noticed that when the agent hit an error, and of course we get an error RPC remote, then... it was kind of repetitive to keep saying remote error cause any error from the agent is remote. So I imagine rather than:
RemoteError
  ActualError
It should be
ActualError
  RemoteError
That way the idea we flip the cause chain, and say, this is the "origin" of the error, and it was later wrapped as whatever it is.

So what is the opposite of cause? Like effect? This might make more sense:
ActualError
  then: RemoteError

ActualError
  by: RemoteError

ActualError
   followed: RemoteError

I leaning towards "then".

tegefaulkes commented 9 months ago

That seems like a reasonable idea. My first thought was parent for the chain but that might be a bit to programmer language for a readability.

MatrixAI / js-rpc