MatrixAI / Polykey-CLI

Polykey CLI - Open Source Decentralized Secret Sharing System for Zero Trust Workflows
https://polykey.com
GNU General Public License v3.0
6 stars 3 forks source link

Server stream progress updates for RPC calls that have long running async operations #264

Open tegefaulkes opened 3 weeks ago

tegefaulkes commented 3 weeks ago

Specification

In the case of vaults clone we have a unary call. The call will be active for the duration of the cloning process which in the background is a compound operation of complex streams. For a very large vault we run into a problem with the client level unary call timing out before we can complete the call it is timing out.

To fix this we need to change the unary call to a server streaming call. We then need to stream over progress updates to reset the timeout timer to prevent the time out. In the case of vaults clone this means sending over the cloning process periodically. Either every 5% or every few seconds or so. Then when the clone is complete we send over details such as the vault name and the new vault ID for that cloned vault. For most cases we can't really know what the absolute progress is. But we can output the amount of arbitrary progress made. This progress must be output on stderr like any other feedback output.

This change to streaming a progress updates need to apply to and client RPC call that waits for a complex or long running task. As far as I can tell this just applies to the vaults cloning, pulling and cross signing claims. But we'll need to investigate this deeper.

Additional context

Tasks

  1. Identify all client RPC calls that wrap complex long running operations
  2. Refactor these RPC calls to be a server stream call that sends over progress updates. The details of the progress isn't actually important so much that they reset the timeout for the call.
linear[bot] commented 3 weeks ago

ENG-397 Server stream progress updates for RPC calls that have long running async operations

CMCDragonkai commented 3 weeks ago

image

Basically we need to stream progress to STDERR to avoid a ErrorRPCTimeout, but those bugs #185 and #198 need to be fixed too.

CMCDragonkai commented 3 weeks ago

@tegefaulkes mentioned that the git cloning process doesn't have an abort system. So if we timeout, it's likely the cloning continues, and the connection is leaked.

CMCDragonkai commented 3 weeks ago

This problem was actually discussed earlier here https://github.com/MatrixAI/Polykey-CLI/issues/74 and also connected to https://github.com/MatrixAI/js-rpc/issues/52 and https://github.com/MatrixAI/js-rpc/issues/57.

@tegefaulkes you even said:

Any commands that attempt a connection such as nodes add or vaults clone could take a long time to complete, especially if we fail to connect. The RPC timeout is 15 seconds with the connection timeout being the same 15 seconds but could take multiple 15 second attempts before giving up. Worst case is about 20/3 attempts to fail a nodesFind. This will cause a RPC timeout before we can respond with the actual error.

However using the grace timer solution is not sufficient. As in https://github.com/MatrixAI/js-rpc/pull/59.

That assumed that the agent side would be able to tell the client side a richer error if the client side timed out.

However we understand that some operations like vaults clone might just take a long time. In which case, timeouts like the regular 15 seconds shouldn't apply like this. Grace timer can still work in the case of true unary operations. But vaults cloning and pulling are not. They are like download operations that may take arbitrary time. So that's why we can identify these operations as something requiring a streaming call instead.

tegefaulkes commented 3 weeks ago

I also need to note that the RPC handlers need to properly handle CTX and abortion properly.