Closed joshuakarp closed 2 years ago
Consider the possibility that each "NodeConnection" object can be
considered a "Data Access Object" that is actually handed to the
relevant domains that require nodes to use. Then NodeManager is about
maintaining a lifecycle of NodeConnection
objects. This also means no
other domains should maintain or keep the reference long-running. Node
connection objects should only be short-running, and a pool of them is
kept in the node manager.
This is similar to how database connection pools are done. See how async postgresql is done in Nodejs, or asyncpg in Python. Whether we are connecting to a database, or to another keynode, the architecture and style should be the same.
On 8/11/21 3:21 PM, Josh wrote:
Requirements of this design
Eventually, we'd like to facilitate some discussion about the structure of the |nodes| domain. This would aim to improve the abstractions and separate shared classes of functionalities.
Additional context
Currently the |nodes| domain has the following structure:
- |NodeManager|: the top-level class that exposes API functions to the outside world. Contains and manages its own instance of |NodeGraph|, as well as handling all instances of |NodeConnection| objects. Currently, these functions are either: o public API 'relays' that internally call |NodeGraph| functions o connection-related functions (internally set up a |NodeConnection| object and then call |NodeConnection| functions)
- |NodeGraph|: a data structure containing the implementation of Kademlia, used to distribute and store the node ID -> node address mappings in a keynode
- |NodeConnection|: a class that handles a single client -> server connection between 2 keynodes (entry-point for agent to agent GRPC functions and message creation)
|NodeManager| is currently the only public-facing entrypoint to the nodes domain. Both |NodeGraph| and |NodeConnection| could be considered as a layer below |NodeManager|.
Roger and I were briefly discussing this structure when the idea of dependency injecting |NodeGraph| into |NodeManager| was brought up (in line with the other domains of Polykey). However, |NodeGraph| is currently taking |NodeManager| as a dependency in its constructor (passed as |this| when |NodeGraph| is constructed in |NodeManager|'s constructor):
// NodeManager.ts constructor({ ...}) { this.nodeGraph = new NodeGraph({
nodeManager:this, ... }) }This is required such that |NodeGraph| can call the connection-related functionality specified in |NodeManager|.
A solution to resolve this mutual dependency would be to "abstract the commonality". We could potentially remove the connections functionality from |NodeManager| entirely, replacing it into a |NodeConnections| class (basically a manager specifically for the lifetime of |NodeConnection| objects).
But then we could ask ourselves what the point of this is? That is, what is |NodeManager| actually responsible for anymore?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MatrixAI/js-polykey/issues/225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE4OHM5TT6QG75PBJTS4VLT4ICG7ANCNFSM5B5QZ7AA.
I think we should follow an old-school "Connection Pool" design pattern. Basically how client-server databases like postgresql and mysql is provided to web applications.
On 8/11/21 3:21 PM, Josh wrote:
Requirements of this design
Eventually, we'd like to facilitate some discussion about the structure of the |nodes| domain. This would aim to improve the abstractions and separate shared classes of functionalities.
Additional context
Currently the |nodes| domain has the following structure:
- |NodeManager|: the top-level class that exposes API functions to the outside world. Contains and manages its own instance of |NodeGraph|, as well as handling all instances of |NodeConnection| objects. Currently, these functions are either: o public API 'relays' that internally call |NodeGraph| functions o connection-related functions (internally set up a |NodeConnection| object and then call |NodeConnection| functions)
- |NodeGraph|: a data structure containing the implementation of Kademlia, used to distribute and store the node ID -> node address mappings in a keynode
- |NodeConnection|: a class that handles a single client -> server connection between 2 keynodes (entry-point for agent to agent GRPC functions and message creation)
|NodeManager| is currently the only public-facing entrypoint to the nodes domain. Both |NodeGraph| and |NodeConnection| could be considered as a layer below |NodeManager|.
Roger and I were briefly discussing this structure when the idea of dependency injecting |NodeGraph| into |NodeManager| was brought up (in line with the other domains of Polykey). However, |NodeGraph| is currently taking |NodeManager| as a dependency in its constructor (passed as |this| when |NodeGraph| is constructed in |NodeManager|'s constructor):
// NodeManager.ts constructor({ ...}) { this.nodeGraph = new NodeGraph({
nodeManager:this, ... }) }This is required such that |NodeGraph| can call the connection-related functionality specified in |NodeManager|.
A solution to resolve this mutual dependency would be to "abstract the commonality". We could potentially remove the connections functionality from |NodeManager| entirely, replacing it into a |NodeConnections| class (basically a manager specifically for the lifetime of |NodeConnection| objects).
But then we could ask ourselves what the point of this is? That is, what is |NodeManager| actually responsible for anymore?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MatrixAI/js-polykey/issues/225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE4OHM5TT6QG75PBJTS4VLT4ICG7ANCNFSM5B5QZ7AA.
To add to this as well, currently we don't check for "liveness" of an active NodeConnection
object. That is, while the networking domain cleans up any connections that have become inactive or offline, the nodes domain doesn't do such a thing. We do some kind of "activity" check in NodeManager::pingNode
(checking an already active connection if it's still active, to ascertain whether the node is still online), but this doesn't clean any objects up, nor does it perform this operation on a regular basis.
Obviously, if we move to a connection pool pattern, then this will be handled with that.
Furthermore, we need to be correctly handling networking errors. From the nodes CLI fixes MR:
- Networking errors are strangely managed here. The networking domain works a bit differently from the rest of the other domains. At any point in time, an exception may be thrown if the connection has timed out. In the networking domain this is managed by exception handlers that are attached to the relevant connections. The
NodeManager
cannot expect to "catch" these networking exceptions on a single control flow branch. They must attach handlers to eachNodeConnection
. This means the "handling" of exceptions itself is a stateful live operation. You must then design and consider theNodeManager
as a state machine managing multiple connection state machines. Ensure that all errors caught are network domain level errors, you shouldn't be catching nakedUTP_ETIMEDOUT
errors.
Some notes regarding this from our sprint meeting a couple of weeks ago (9th August):
NodeConnection
objects are interacting with the forward and reverse proxies. Relating to my previous comment, if the Connection
object in the networking domain has thrown an error (and triggers some kind of event in the proxies), then we need to also handle/delete the corresponding NodeConnection
object.If we use the connection pool pattern, the above points need to be considered.
So we already have the concept of a "connection pool", whereby no other domains maintain or hold a long-running reference to a NodeConnection
. All instances of NodeConnection
are contained within a NodeConnectionMap
type in NodeManager
(see https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/205#note_699911271 for details of this map structure - it's used to enforce a locking when a connection is established). They're created as needed, and destroyed on NodeManager
shutdown (as per previous comments, it also needs to be implemented such that are also destroyed upon connection closure in the proxies).
I've been thinking about the relationship of the nodes domain relative to other domains.
Currently the nodes domain gets injected to other domains to be used and those domains will often asks the nodes domain for node connections to work with.
I realise that there's a hierarchy here. Nodes domain is a lower level domain compared to vaults and, gestalt graph and related. High and low level relate to how user-facing it is. User operations include CLI and API calls to the PK agent which then are handled by a grpc server which will call these domains.
For example if vaults domain were at the same level as nodes domain, then it would mean that the call handler would interact with both domains directly and then pass any relevant objects from the nodes domain to the vaults domain. This is how I've often done my database structure like:
doSomething(dbConn, p1, p2)
The reason to do this was to be able to share db connection instance with multiple operations and to potentially allow the controller to control nested contexts relating to the connection such as a transaction context across multiple operations. Furthermore connection resource lifetime was tied to the controller handling rather than to specific model operations.
To a significant extent multiple domains may be involved in working against the same node in the same way multiple models may be working with the same DB connection.
If the nodes domain is considered something lower level that is injected into the different domains, it would be like constructing models by injecting a db dependency in them. This works as long as we think that the "controller" which in our case is the gRPC API handlers doesn't have to control nested contexts for a given node connection used by multiple domains.
This would mean that the domains like vaults and such is much more richer and performs controller-like functionality.
If we find out that there needs to be operations done across domains, how do we deal with this atm? Either we abstract another domain on top and then use that like we are doing with vaults domain on top of nodes and db domain, or we eventually flatten the hierarchy and do alot more operations at the grpc handlers.
The file structure right now lends towards a flattened hierarchy. But the internal connections through DI indicates a more complex hierarchy. It's usually better to have more flattened hierarchy and leave the composition at the very end for maximum flexibility but the trade off is more overhead and boilerplate.
I'll need to see how the vaults use the nodes domain along with other domains to really get a good feel of how to proceed.
@joshuakarp I think you have been focusing on nodes domain internal architecture and I'm also concerned about the relationship between nodes and other domains and how it will all work.
Furthermore, we need to be correctly handling networking errors. From the nodes CLI fixes MR:
- Networking errors are strangely managed here. The networking domain works a bit differently from the rest of the other domains. At any point in time, an exception may be thrown if the connection has timed out. In the networking domain this is managed by exception handlers that are attached to the relevant connections. The
NodeManager
cannot expect to "catch" these networking exceptions on a single control flow branch. They must attach handlers to eachNodeConnection
. This means the "handling" of exceptions itself is a stateful live operation. You must then design and consider theNodeManager
as a state machine managing multiple connection state machines. Ensure that all errors caught are network domain level errors, you shouldn't be catching nakedUTP_ETIMEDOUT
errors.Some notes regarding this from our sprint meeting a couple of weeks ago (9th August):
- we can view the networking domain's components as singleton instances
- that is, they act as an internal service that we share across our other domains
- when we handle errors with respect to connections, we can't expect a single control flow like we would with a regular sequential program/function (e.g. do something and catch an error)
With connections, we have 2 kinds of errors:
- During construction (can do a try-catch here - e.g. we currently do this when we ping a node, in the connection's start function)
- During liveness of the connection object (these need to be handled with error handlers)
- error handlers need to be attached to connection objects to handle this second type of error
- furthermore,
NodeConnection
objects are interacting with the forward and reverse proxies. Relating to my previous comment, if theConnection
object in the networking domain has thrown an error (and triggers some kind of event in the proxies), then we need to also handle/delete the correspondingNodeConnection
object.If we use the connection pool pattern, the above points need to be considered.
Handling NodeConnection
errors propagated from the network
domain will be handled in #224.
I've turned this issue into an epic and added all the related issues that can all be solved in one go with some restructuring of the nodes domain. This was based on the discussion on the nodes domain test failures happening since we were reviewing the CARL, session management and async-init fixes #281 applied there.
I'll be fleshing out the specification here in the issue description regarding what we want to do.
Spec has been updated, please review. This is going to be the biggest loose end before release.
Implementation of this will require @tegefaulkes and @joshuakarp.
I've been playing around with some timer mocking for the existing tests (https://jestjs.io/docs/timer-mocks). Ideally I could get these reduced and working before we start refactoring, such that it's easier to diagnose a test failure from implementation as opposed to attempting to diagnose issues with this after refactoring nodes.
We can see the following test using runAllTimers
to advance the setTimeout
call (within sleep
):
For the ping
test in NodeManager.test.ts
, it's currently testing 3 situations in one test (such that we only need to setup a single PolykeyAgent
instance):
false
(requires a connection attempt to timeout)true
(connection can be established)false
(existing connection needs to timeout)What's interesting with the ping
test though, is that we actually require this physical amount of time to pass, such that the underlying network connection can timeout. Trying to figure out how to mock that lower-level timeout.
We'll be merging #289 before beginning work on this refactoring effort. The seed nodes work is simple enough to merge, such that we don't over-complicate it with this refactoring.
As a very rough starting point, I'm estimating this would take about a week and a half (including testing and review). Currently scheduling it after the testnet deployment, from Monday Dec 6th to Wednesday Dec 15th.
As another supporting task, @emmacasolin could also be assigned to figuring out the timer mocking for the nodes tests with this issue (this already has some relation to #264 too). Would also be a reasonable introduction to the nodes
domain.
@tegefaulkes will most likely be on leave when we're tackling this issue, so for now have unassigned him.
For our NodeConnection
structure using the ObjectMap
locking system (acquiring a lock on creation), this will also need to be added for destruction of NodeConnection
s (i.e. acquire the lock before destruction). Also discussed here regarding vaults: https://github.com/MatrixAI/js-polykey/pull/266#issuecomment-984262113
There was also a point brought up whether it would be worthwhile to refactor the nodes domain to use the start-stop
pattern, instead of the current create-destroy-start-stop
. From https://github.com/MatrixAI/js-polykey/pull/289#issuecomment-992016820, we've had to move some of the "starting" functionality of the nodes domain directly into the call to PolykeyAgent.start
(as a result of some necessary dependencies on the ForwardProxy
being started prior to these calls). If we were to move to start-stop, our call to NodeManager.start
(and other such classes) would be purely found in PolykeyAgent.start
(instead of internally called from within createPolykeyAgent
with the call to NodeManager.createNodeManager
).
Yes it just means you would be doing new NodeManager
and related in the creation phase. But then actually do nodeManager.start()
in the start phase. Similar to the networking domain objects.
A quick note for myself/@emmacasolin: when looking into connection timeouts for testing, have a read of this comment too https://github.com/MatrixAI/js-polykey/pull/289#issuecomment-992080275.
There was also a point brought up whether it would be worthwhile to refactor the nodes domain to use the
start-stop
pattern, instead of the currentcreate-destroy-start-stop
. From #289 (comment), we've had to move some of the "starting" functionality of the nodes domain directly into the call toPolykeyAgent.start
(as a result of some necessary dependencies on theForwardProxy
being started prior to these calls). If we were to move to start-stop, our call toNodeManager.start
(and other such classes) would be purely found inPolykeyAgent.start
(instead of internally called from withincreatePolykeyAgent
with the call toNodeManager.createNodeManager
).
I've been having a think about this. Right now:
NodeManager
NodeGraph
NodeConnection
all use the create-destroy-start-stop pattern.Given that NodeGraph
is the only domain that actually has any database state, I believe everything else could be safely changed to using start-stop. (Although, NodeGraph
is an encapsulated dependency of NodeManager
. I imagine this means that we'd need some kind of destroy
call in the top-level NodeManager
, such that it can propagate down to NodeGraph
. This is how it's currently structured.)
Specifically with respect to the vaults domain, this refactoring effort shouldn't actually cause too many changes there.
Previously, the vaults domain was already (previously incorrectly) performing operations on the GRPCClientAgent
directly, instead of calling internal NodeConnection
functions (calling a getClient
function, which returned the client, and then performing operations on the client directly):
const nodeConnection = await this.nodeManager.getConnectionToNode(
pullNodeId!,
);
const client = nodeConnection.getClient();
// ... do operations on client
This functionality is actually what we intend to move towards, and as such, shouldn't require too many changes. We'll instead just need to wrap the operations on the client inside NodeConnection
's withConnection
higher-order function.
Notably, in the vaults domain, the following will need to be checked and changed:
pullVault
cloneVault
The actual API of the vaults domain remains unchanged though. We simply inject NodeConnectionManager
instead of NodeManager
.
After some discussion, a proposed architecture diagram:
┌──────────┐
│ Sigchain │
└────▲─────┘
│
│
│
┌──────┴──────┐
│ NodeManager ├──────────────────┐
└──────┬──────┘ │
│ │
│ │
│ │
┌───────────▼───────────┐ │
┌──────────────┐ │ │ │
│ ForwardProxy ◄───────┤ NodeConnectionManager │ │
└──────────────┘ │ │ ┌─────▼─────┐
│ ┌──────────────┐ ├───────► NodeGraph │
┌──────────────┐ │ │NodeConnection│ │ └───────────┘
│ ReverseProxy ◄───────┤ └──────────────┘ │
└──────────────┘ │ │
└─────▲─────▲────▲──────┘
│ │ │
│ │ │
┌──────────┘ │ └─────────────┐
│ │ │
┌────┴────┐ ┌──────┴─────┐ ┌─────────┴──────────┐
│Discovery│ │VaultManager│ │NotificationsManager│
└─────────┘ └────────────┘ └────────────────────┘
In this architecture some desired APIs:
findNodes
, getClosestGlobalNodes
, getClosestLocalNodes
, these are extracted from NodeManager
and NodeConnectionManager
, it also facilitates hole punchingNodeManager
- focuses on nodes-related high level functionality like sigchain operations, claiming operations and mediates access to the node graph, it has to use NodeConnectionManager
when establishing connections to do workNodeGraph
- stays focused on database operations as much as possible, and is used by both NodeManager
and NodeConnectionManager
NodeManager
, for receiving instructions about hole punching it would use NodeConnectionManager
, so some handlers may end up using bothNodeManager
or NodeConnectionManager
or both, not sure as it needs to deal with both sigchain and traversing the P2P networkAll other domains should be only using NodeConnectionManager
to acquire the connections and do work on them. Order of operations.
NodeGraph
- CDSS - because this is now shared between NodeConnectionManager
and NodeManager
, this has to be created outside of NodeManager
as a singleton and shared - created firstNodeConnectionManager
- SS or CDSS or CD - created secondNodeManager
- SS or CDSS or CD - created thirdNote that NodeConnectionManager
requires ForwardProxy
and ReverseProxy
, and these are started with knowledge only after starting the GRPC servers. But these can be passed during creation.
@joshuakarp we will need specced out APIs for of the nodes classes.
Some notes from our discussion about comparing its API to the vaults domain:
1. Review ID Usage
2. NodeConnectionManager Refactoring
3. Networking and NodeConnection Errors
4. Transaction Usage in NodeGraph
5. NodeGraph Testing
6. P2P Integration Testing Including Hole Punching
NodeConnection
GRPCClient -> ForwardProxy -> ReverseProxy -> GRPCServer
class Domain {
// retries are domain specific, we wouldn't auto retry for the node connection manager
public async doSomething() {
nodeConnectionManager.withConnection(async (conn) => {
await sleep(1000);
// the conn is actually "destroyed" because there was a state change in the underlying socket
await conn.X(); // this throws a js-async-init exception ErrorNodeConnectionNotRunning()
});
}
}
class NodeConnectionManager {
connections: Map();
public withConnection() {
const conn = new NodeConnection(onDestroy);
}
}
class NodeConnection {
// This may need to make use of locking
async createNodeConnection(connections) {
this.connections = connections;
const client = new GRPCClientAgent(onError);
}
// You have to lock this as well, similar to the Vault lifecycle
async destroy() {
// delete itself from the node connections state
this.connections.delete(nodeId);
// this is an alternative handler passed in from NodeConnectionManager
this.onDestroy();
}
onError = () => {
await this.destroy();
};
}
class GRPCClientAgent extends GRPCClient {
}
class GRPCClient {
async createGRPCClient(onError) {
this.client.watchConnectivityState(currentState, deadline, async () => {
await onError();
});
}
}
// The vaults API is an interesting comparison as they have a low level API
public vaultManager.createVault(): Promise<void>
protected vaultManager.openVault()
protected vaultManager.closeVault()
public vaultManager.destroyVault(): Promise<void>
// If we follow `NodeConnectionManager`, we may also expose a higher level callback API
vaultManager.withVault(async (v1) => {
vaultManager.withVault(async (v2) => {
vaultOps.mv(v1, v2);
});
});
// Example of using this with vault ops which takes the vaults
vaultOps.mv(v1, v2) {
// dual contexts, nested commits, similar to python's `with`
// watch out, what happens if v1 and v2 are the same vault?
withCommit([v1, v2], async ([fs1, fs2]) => {
});
}
Specification
The Nodes domain is an integration of all the other domains that enables Polykey to be a distributed and decentralised application. Right now it's design is causing its tests and its code to be quite brittle as other domains are being iterated on. This is slowing down our development process. At the same time, it has a number of functionalities that time to do such as pinging, these are now causing very slow tests up to 60 or 80 seconds for a single test to just do pinging.
The nodes domain require a significant refactoring to ensure stability and robustness and performance going into the future.
After a small review over the nodes codebase, here is the new proposed structure.
NodeConnectionManager
NodeGraph
(optional dependency)claimNode
NodeConnection
in a pool of node connectionswithConnection
to run an asynchronous callback that usesNodeConnection
NodeManager
's "wrapper" functions that performNodeConnection
operations would instead create their own internal functions (within their own domains) that use this higher-order functionwithConnection
to perform their connection functionality.NodeManager
should not have any of this anymoreNodeConnection
and closes the connection after timeoutGRPCClientAgent
, and provides some lifecycle features specific to the node connectionGRPCClientAgent
or es6 proxy or wrapper classclient
propertyNodeId
->NodeAddress
mappings in a keynodeBoth
NodeManager
andNodeConnectionManager
should be exposed at the high level inPolykeyAgent
and provided tocreateClientService
.Furthermore, the following lists the current usage of
NodeManager
forNodeConnection
operations (i.e. the places whereNodeConnectionManager
would instead need to be injected):Discovery
NodeManager
NodeGraph
- this is a special case that will need some potential refactoring, see https://github.com/MatrixAI/js-polykey/issues/225#:~:text=in%3A%20%23224%20(comment)-,NodeGraph,-NodeGraph%20is%20aNotificationsManager
VaultManager
The following diagram depicts these dependencies:
NodeManager
The
NodeManager
manages nodes, but does not manage the node connections.It is expected that
NodeManager
requiresNodeConnectionManager
which is shared as singleton with other domains.Operations relating to manipulating nodes should be on
NodeManager
such asclaimNode
and related operations.Most of the time this will be called by GRPC handlers, but some domains like discovery might be using the nodes domain. It is also what encapsulates access and mutation to the
NodeGraph
.NodeConnectionManager
The
NodeConnectionManager
manages a pool of node connections. This is what is being shared as a singleton to all the domains that need to contact a node and call their handlers. For exampleVaultManager
would not useNodeManager
but onlyNodeConnectionManager
. This also includes notifications domain.The connections are shared between each node connection.
Notice the lock being used to ensure that connection establishment and connection closure is serialised. Refer to: https://gist.github.com/CMCDragonkai/f58f08e7eaab0430ed4467ca35527a42 for example. This pattern is also used in the vault manager.
A draft mock of the class:
In the
withConnection
call, this is the only way to acquire a connection to be used by other domains. If this is to be similar to theVaultManager
, that mean thatVaultManager.withVault
would need to exist.The locking side effects could be done in
NodeConnectionManager
or inside theNodeConnection
. However since this problem also exists in theVaultManager
, I believe we want to do the side-effects as much as possible in theNodeConnectionManager
. If locking during creation and destruction needs to occur, then it has to occur via callbacks that are passed into theNodeConnection
.NodeConnection
There may be a
NodeConnectionInternal
to have the actual methods whileNodeConnection
maybe a public interface similar toVault
andVaultInternal
.Usage
Any call to another keynode has to go through the
NodeConnection
. It is expected that callers will perform something like this:Such usage would occur in both
NodeManager
and other domains such asVaultManager
.Note that the
conn.client.doSomething()
may beconn.doSomething()
instead. This depends on howNodeConnection
is written.Either way it is expected that
doSomething
is the same gRPC call available inGRPCClientAgent
.This means it is up to the domains to manage and handle gRPC call behaviour which includes the usage of async generators, metadata and exceptions. Refer to #249 for details on this.
The reason we are doing this is that abstractions on the gRPC call side are domain-specific behaviour. We wouldn't want to centralise all of the domain call gRPC abstractions into the nodes domain, that would be quite brittle.
This means the vaults domain could if they wanted to create their own wrapper over the connection object such as:
Right now we shouldn't do any gRPC abstractions until #249 is addressed.
Lifecycle
The
withConn
method provides a "resource context". It's a higher order function that takes the asynchronous action that is to be performed with the connection.It has a similar pattern to our
poll
,retryAuth
,transact
,transaction
,_transaction
wrapper functions.The advantage of this pattern is that the node connection will not be closed (by the TTL) while the action is being executed.
The
withConn
will need to use theRWLock
pattern as provided insrc/utils.ts
. It has to acquire a read lock when executing the action.This enables multiple
withConn
to be executing concurrently without blocking each other.At the end of each
withConn
, theTLL
should be reset.When the TLL is expired, it may attempt to close the connection. To do this, a write lock is attempted. If the lock is already locked, then we skip. If we acquire the write lock, we can close the connection.
If the client were to be closed or terminated due to other reasons, then an exception should be thrown when accessing the
client
or its methods.This would be
ErrorNodeConnectionNotRunning
orErrorGRPCClientAgentNotRunning
orErrorGRPCClientNotRunning
.Other domains should not be keeping around references to
NodeConnection
it should only be used inside thewithConn
context.Hole Punching Relay
Hole punching is done between the
ForwardProxy
andReverseProxy
. Hole punching is not done explicitly in theNodeConnectionManager
orNodeConnection
.Instead, hole punching relay is what is done by
NodeConnectionManager
. This means you relay a hole punching message across the network so that the target Node will perform reverse hole punching in case we need it. This process should be done at the same time as a connection is established.This means
NodeConnectionManager
requires theNodeGraph
, the sameNodeGraph
that is used byNodeManager
.The graph is needed to do optimal routing of the the hole punching relay message.
This means
NodeConnectionManager
would need to be used bycreateAgentService
as the current keynode may be in the middle of a chain of relaying operations.Public Interface
The public interface of
NodeConnection
can be a limited version of the internal class. This can hide methods likedestroy
from the users ofNodeConnection
. This ensures that onlyNodeConnectionManager
can destroy the connection.Dealing with Termination/Errors
The underlying
GRPCClientAgent
may close/terminate for various reasons. In order to capture this behaviour, you must hook into the state changes in theClient
object. This will require changes tosrc/grpc/GRPCClient.ts
, as it currently does not expose theClient
object. Discussions on how to do this are in: https://github.com/matrixai/js-polykey/issues/224#issuecomment-966770905NodeGraph
NodeGraph is a complex data structure. It appears to be needed by both
NodeManager
andNodeConnectionManager
, but itself also needs to perform operations onNodeManager
. Whenever there is mutual recursion, it can be factored out as another common dependency "absract the commonality". However if this is not suitable, then it can be done by passingNodeManager
asthis
intoNodeGraph
.However if
NodeConnectionManager
needsNodeGraph
as well, then this is not possible unlessNodeConnectionManager
was also produced byNodeManager
.Further investigation is needed into the
NodeGraph
to understand how we can turn it into a data structure. Perhaps we can use a pattern of hooks whereNodeGraph
exposes a number of hook points which can be configured byNodeManager
and thus allowNodeGraph
to "callback"NodeManager
andNodeConnectionManager
.If
NodeGraph
can be made purely a data structure/database, then:getClosestLocalNodes
andgetClosestGlobalNodes
could be part of theNodeManager
, and not theNodeGraph
.NodeManager
, and instead you just plug it with actual values.There's only 3 places where
NodeGraph
is calling out to the world. All 3 can be done inNodeManager
instead. That meansNodeGraph
no longer needsNodeConnectionManager
.Testing
For
NodeConnectionManager
andNodeConnection
, it's essential that we don't duplicate tests forGRPCClientAgent
as that would be handled bytests/agent
. It must focus on just testing the connection related functionality. This is where mocks should be useful here, since we want to delay integration testing until the very end.There are 4 kinds of mocks we have to use here.
tests/bin/utils.retryAuth.test.ts
, function mocks can be very useful way of providing a dummy function that we can later instrument. This can be useful if we pursue a hook pattern withNodeGraph
.tests/bin/utils.retryAuth.test.ts
(jest.mock('prompts')
) this is high level mocking that should only be used when no other methods can be done. This is more complicated, but you can mock anything this way.Module/class
mocks.The tests for the nodes domain should be robust, reproducible, not flaky and complete within 10 - 30 seconds.
Larger integration tests should be done at a higher level, which will help us achieve #264.
This should ensure that nodes domain is more stable and does not break every time we change stuff in other domains.
Additional context
281 - Async Init Fixes that is applied in CARL & Session Management MR: https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213
194 - Seed Nodes Integration, application of DNS resolution is relevant here
249 - GRPC API Review
264 - Concurrent Testing Scalability
Tasks
NodeConnectionManager
NodeConnection
intoNodeConnectionManager
andNodeManager
destroy
call ofNodeConnection
(the same process as construction): https://github.com/MatrixAI/js-polykey/issues/225#issuecomment-984262851NodeGraph
andNodeManager
andNodeConnectionManager
NodeConnection
andNodeConnectionManager
testing. Remove any tests that test agent service handling or forward proxy and reverse proxy usage. Rely on the other domain tests to handle this.NodeConnection
to acquire their connections fromNodeConnectionManager
.