Closed emmacasolin closed 2 years ago
Was this done? The NodeConnectionManager
should have a timeout override, but is the discovery making use of it?
The discovery domain doesn't seem to have been touched in #378 so I don't think so
This is a more general problem since the discovery system doesn't set timeouts for connection to nodes and identity providers. Identity providers uses cross-fetch
that doesn't have a programmed timeout but uses an AbortSignal
. Which we are not utilising at the moment.
To solve this in a general sense we need to make use of the timedCancellable
decorator for all of the discovery tasks. We need to incorporate a task deadline no matter weather it's connecting to a node or an identity provider. The deadline doesn't just matter for connecting to a node but also the RPC calls. This can be implemented as a GRPC deadline.
Places that require integration with Timer
, timedCancellable
decorator and HOF:
GitHubProvider
NodeConnectionManager
, NodeConnection
, Proxy
Discovery
GRPCClientAgent
In the GRPC client stubs this is an example of the call signature. Take notes of the options: Partial<grpc.CallOptions>
, this is where you pass the deadline property. This is defined as a type Deadline = Date | number
as a Date
in the future or a Unix time in milliseconds. If the server doesn't answer before the deadline then we get an exception.
public nodesClosestLocalNodesGet(request: polykey_v1_nodes_nodes_pb.Node, metadata: grpc.Metadata, options: Partial<grpc.CallOptions>, callback: (error: grpc.ServiceError | null, response: polykey_v1_nodes_nodes_pb.NodeTable) => void): grpc.ClientUnaryCall;
The server should also have a default deadline set for if the client doesn't pass a deadline to the server. This is to prevent a DDOS attack or the case of slow clients holding up the server. According to the GRPC docs the client can cancel at any time. but the server can just close the stream or end with a response. This will likely be a server deadline exceeded exception that the server throws. The longest deadline here will be the server's deadline. If the client deadline is reached then the client throws it's deadline error and cancels the connection. It is possible that both sides throws the exception when the deadline is exceeded.
1-2 seconds is too strict, especially for contacting identities. It's also not about what the deadline is for establishing connection but deadline for each individual call as well. It's not that the discovery is serialised, We need to make this concurrent so we can have each timeout be long running but do multiple at a time.
So Discovery
needs to make use of the TaskManager
to do concurrent tasks to do discovery. There should be at least 2 task handlers.
With reference to #328 the decision for which nodes and identities to visit should be proximity to our own gestalt, priority and trust. Where trust is the ACL property.
Let's make #445 stage the changes of integration task manager into discovery.
But to actually to solve this issue, we will need to incorporate deadlines to the RPC communication, and also timers and timedCancellable
to all the methods in Discovery, Nodes... etc.
The main thing to understand here is to realise that Discovery is doing something similar to NodeGraph, only that the GestaltGraph is the database.
This issue is a pretty old one and the way we solve this has changed by the usage of timedCancellable
. This issue is pretty specific to using shorter timeouts than the NodeConnectionManager
defaults when making connections to other nodes during discovery.
The core changes for this is handled by #477. but that is a generalised update to the discovery domain. This will be complete once #477 is complete and we set the shorter timeouts for connections during discovery.
Discovery.processVertex
has been updated to pass the ctx and connectionTimeout
paramets to all calls within it that establish a connection, Theses are nodeManager.requestChainData
and discovery.getIdentityInfo
. These have their individual timeouts applied but can still be cancelled via processvertx
's cancellation or timeout. I've set connectionTimeout
to 2000ms for the task handler. This may need to be adjusted.
Specification
The default timeout for node connections is currently set inside the
NodeConnectionManager
for 20s. This can only be set when creating a new Polykey agent by setting thenodeConnectionManagerConfig
when callingPolykeyAgent.createPolykeyAgent()
:This value is subsequently used for all node connections, regardless of what the node connection is used for or where it originates from. This is a problem for the Discovery domain, where a default timeout of 20s is too long. The Discovery domain calls
NodeManager.requestChainData()
, which callsNodeConnectionManager.withConnF()
, which subsequently creates the node connection that we use. We need to be able to change the timeout for creating this node connection from inside the Discovery domain to something smaller, for example 1-2s. This does not need to be a modifiable option that is available to the user, since it should not affect the behaviour of the Discovery domain, so the timeout can just be a constant.In order for the Discovery domain to be able to set its own timeout for Node Connections, we need to propagate the timeout through a few different places in the code. Specifically:
Discovery
classNodeManager.requestChainData()
needs to have an extra parameter for this timeout (alternatively, since thisrequestChainData()
method is not used anywhere else outside ofDiscovery
we could move it into theDiscovery
class and methods called by it could just directly use theDiscovery
's timeout property)NodeConnectionManager.withConnF()
(called byrequestChainData()
) needs an additionaltimeout
parameterNodeConnectionManager.acquireConnection()
(called bywithConnF()
) needs an additionaltimeout
parameterNodeConnectionManager.createConnection()
(called byacquireConnection()
) needs an additionaltimeout
parameterNodeConnectionManager.establishNodeConnection()
(called bycreateConnection()
) needs an additionaltimeout
parameter which defaults toNodeConnectionManager.connConnectTime
if not set explicitly - all instances of usingNodeConnectionManager.connConnectTime
inside this method need to switch over to using this parameter insteadAdditional context
Tasks
timeout
property for theDiscovery
class (around 1000-2000ms)