MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
31 stars 4 forks source link

Reduce the timeout for establishing a Node Connection within the Discovery domain (by adding timer override to `NodeConnectionManager`) #353

Closed emmacasolin closed 2 years ago

emmacasolin commented 2 years ago

Specification

The default timeout for node connections is currently set inside the NodeConnectionManager for 20s. This can only be set when creating a new Polykey agent by setting the nodeConnectionManagerConfig when calling PolykeyAgent.createPolykeyAgent():

// From src/config.defaults
nodeConnectionManagerConfig: {
  connConnectTime: 20000, // this is the relevant setting
  connTimeoutTime: 60000,
  initialClosestNodes: 3,
},

This value is subsequently used for all node connections, regardless of what the node connection is used for or where it originates from. This is a problem for the Discovery domain, where a default timeout of 20s is too long. The Discovery domain calls NodeManager.requestChainData(), which calls NodeConnectionManager.withConnF(), which subsequently creates the node connection that we use. We need to be able to change the timeout for creating this node connection from inside the Discovery domain to something smaller, for example 1-2s. This does not need to be a modifiable option that is available to the user, since it should not affect the behaviour of the Discovery domain, so the timeout can just be a constant.

In order for the Discovery domain to be able to set its own timeout for Node Connections, we need to propagate the timeout through a few different places in the code. Specifically:

  1. The chosen timeout must be set as a constant property of the Discovery class
  2. NodeManager.requestChainData() needs to have an extra parameter for this timeout (alternatively, since this requestChainData() method is not used anywhere else outside of Discovery we could move it into the Discovery class and methods called by it could just directly use the Discovery's timeout property)
  3. NodeConnectionManager.withConnF() (called by requestChainData()) needs an additional timeout parameter
  4. NodeConnectionManager.acquireConnection() (called by withConnF()) needs an additional timeout parameter
  5. NodeConnectionManager.createConnection() (called by acquireConnection()) needs an additional timeout parameter
  6. NodeConnectionManager.establishNodeConnection() (called by createConnection()) needs an additional timeout parameter which defaults to NodeConnectionManager.connConnectTime if not set explicitly - all instances of using NodeConnectionManager.connConnectTime inside this method need to switch over to using this parameter instead

Additional context

Tasks

  1. Create a timeout property for the Discovery class (around 1000-2000ms)
  2. Propagate this timeout through the relevant methods to the establishment of a node connection
CMCDragonkai commented 2 years ago

Was this done? The NodeConnectionManager should have a timeout override, but is the discovery making use of it?

emmacasolin commented 2 years ago

The discovery domain doesn't seem to have been touched in #378 so I don't think so

tegefaulkes commented 2 years ago

This is a more general problem since the discovery system doesn't set timeouts for connection to nodes and identity providers. Identity providers uses cross-fetch that doesn't have a programmed timeout but uses an AbortSignal. Which we are not utilising at the moment.

To solve this in a general sense we need to make use of the timedCancellable decorator for all of the discovery tasks. We need to incorporate a task deadline no matter weather it's connecting to a node or an identity provider. The deadline doesn't just matter for connecting to a node but also the RPC calls. This can be implemented as a GRPC deadline.

Places that require integration with Timer, timedCancellable decorator and HOF:

In the GRPC client stubs this is an example of the call signature. Take notes of the options: Partial<grpc.CallOptions>, this is where you pass the deadline property. This is defined as a type Deadline = Date | number as a Date in the future or a Unix time in milliseconds. If the server doesn't answer before the deadline then we get an exception.

    public nodesClosestLocalNodesGet(request: polykey_v1_nodes_nodes_pb.Node, metadata: grpc.Metadata, options: Partial<grpc.CallOptions>, callback: (error: grpc.ServiceError | null, response: polykey_v1_nodes_nodes_pb.NodeTable) => void): grpc.ClientUnaryCall;

The server should also have a default deadline set for if the client doesn't pass a deadline to the server. This is to prevent a DDOS attack or the case of slow clients holding up the server. According to the GRPC docs the client can cancel at any time. but the server can just close the stream or end with a response. This will likely be a server deadline exceeded exception that the server throws. The longest deadline here will be the server's deadline. If the client deadline is reached then the client throws it's deadline error and cancels the connection. It is possible that both sides throws the exception when the deadline is exceeded.

tegefaulkes commented 2 years ago

1-2 seconds is too strict, especially for contacting identities. It's also not about what the deadline is for establishing connection but deadline for each individual call as well. It's not that the discovery is serialised, We need to make this concurrent so we can have each timeout be long running but do multiple at a time.

So Discovery needs to make use of the TaskManager to do concurrent tasks to do discovery. There should be at least 2 task handlers.

  1. visitNode
  2. visitIdentity

With reference to #328 the decision for which nodes and identities to visit should be proximity to our own gestalt, priority and trust. Where trust is the ACL property.

CMCDragonkai commented 2 years ago

Let's make #445 stage the changes of integration task manager into discovery.

But to actually to solve this issue, we will need to incorporate deadlines to the RPC communication, and also timers and timedCancellable to all the methods in Discovery, Nodes... etc.

CMCDragonkai commented 2 years ago

The main thing to understand here is to realise that Discovery is doing something similar to NodeGraph, only that the GestaltGraph is the database.

tegefaulkes commented 2 years ago

This issue is a pretty old one and the way we solve this has changed by the usage of timedCancellable. This issue is pretty specific to using shorter timeouts than the NodeConnectionManager defaults when making connections to other nodes during discovery.

The core changes for this is handled by #477. but that is a generalised update to the discovery domain. This will be complete once #477 is complete and we set the shorter timeouts for connections during discovery.

tegefaulkes commented 2 years ago

Discovery.processVertex has been updated to pass the ctx and connectionTimeout paramets to all calls within it that establish a connection, Theses are nodeManager.requestChainData and discovery.getIdentityInfo. These have their individual timeouts applied but can still be cancelled via processvertx's cancellation or timeout. I've set connectionTimeout to 2000ms for the task handler. This may need to be adjusted.