Refactoring Nodes Domain: NodeConnection & NodeConnectionManager and Fixing Bugs

Specification

The Nodes domain is an integration of all the other domains that enables Polykey to be a distributed and decentralised application. Right now it's design is causing its tests and its code to be quite brittle as other domains are being iterated on. This is slowing down our development process. At the same time, it has a number of functionalities that time to do such as pinging, these are now causing very slow tests up to 60 or 80 seconds for a single test to just do pinging.

The nodes domain require a significant refactoring to ensure stability and robustness and performance going into the future.

After a small review over the nodes codebase, here is the new proposed structure.

NodeManager
- requires NodeConnectionManager
- encapsulates NodeGraph (optional dependency)
- provides high-level nodes management such as claimNode
NodeConnectionManager
- manages the lifecycle of NodeConnection in a pool of node connections
- provides a context higher-order function withConnection to run an asynchronous callback that uses NodeConnection
  - this means that all domains that currently rely on the NodeManager's "wrapper" functions that perform NodeConnection operations would instead create their own internal functions (within their own domains) that use this higher-order function withConnection to perform their connection functionality. NodeManager should not have any of this anymore
- applies a TTL to each NodeConnection and closes the connection after timeout
- resets TTL on every connection use
- uses a lock to ensure connections cannot be timed out when they are in-use
NodeConnection
- represents a connection to the agent service of another keynode
- wrapper around GRPCClientAgent, and provides some lifecycle features specific to the node connection
- class extension of GRPCClientAgent or es6 proxy or wrapper class
- if wrapper class, then must provide access to the client property
- used by other domains to call the agent service
NodeGraph
- data structure implementation of Kademlia
- stores the NodeId -> NodeAddress mappings in a keynode
- persistent uses DB and may require transactions

Both NodeManager and NodeConnectionManager should be exposed at the high level in PolykeyAgent and provided to createClientService.

Furthermore, the following lists the current usage of NodeManager for NodeConnection operations (i.e. the places where NodeConnectionManager would instead need to be injected):

Discovery
NodeManager
NodeGraph - this is a special case that will need some potential refactoring, see https://github.com/MatrixAI/js-polykey/issues/225#:~:text=in%3A%20%23224%20(comment)-,NodeGraph,-NodeGraph%20is%20a
NotificationsManager
VaultManager

The following diagram depicts these dependencies:

                       ┌───────────────────────┐            ┌─────────┐
┌─────────────┐        │                       ◄────────────┤Discovery│
│             │        │ NodeConnectionManager │            └─────────┘
│ NodeManager ├────────►                       │
│             │        │     ┌─────────────┐   │       ┌────────────────────┐
│ ┌─────────┐ │        │    ┌┴────────────┐│   ◄───────┤NotificationsManager│
│ │NodeGraph├─┼────────►  ┌─┴────────────┐├┘   │       └────────────────────┘
│ └─────────┘ │        │  │NodeConnection├┘    │
│             │        │  └──────────────┘     │           ┌────────────┐
└─────────────┘        │                       ◄───────────┤VaultManager│
                       └───────────────────────┘           └────────────┘

NodeManager

The NodeManager manages nodes, but does not manage the node connections.

It is expected that NodeManager requires NodeConnectionManager which is shared as singleton with other domains.

Operations relating to manipulating nodes should be on NodeManager such as claimNode and related operations.

Most of the time this will be called by GRPC handlers, but some domains like discovery might be using the nodes domain. It is also what encapsulates access and mutation to the NodeGraph.

NodeConnectionManager

The NodeConnectionManager manages a pool of node connections. This is what is being shared as a singleton to all the domains that need to contact a node and call their handlers. For example VaultManager would not use NodeManager but only NodeConnectionManager. This also includes notifications domain.

The connections are shared between each node connection.

Notice the lock being used to ensure that connection establishment and connection closure is serialised. Refer to: https://gist.github.com/CMCDragonkai/f58f08e7eaab0430ed4467ca35527a42 for example. This pattern is also used in the vault manager.

A draft mock of the class:

class NodeConnectionManager {

  connections: Map<NodeId, {
    conn?: NodeConnection;
    lock: MutexInterface;
  }>;

  public constructor(nodeGraph) {}

  public getNodeId(); // lots of connection-related requests need this (get from KeyManager)

  // refreshes the timeout at the end
  public withConnection(targetNodeId, async function());

  // these are used by `withConnection`
  // reusing terms from network domain:
  // connConnectTime - timeout for opening the connection
  // connTimeoutTime  - TTL for the connection itself
  protected createConnection();
  protected destroyConnection();

}

In the withConnection call, this is the only way to acquire a connection to be used by other domains. If this is to be similar to the VaultManager, that mean that VaultManager.withVault would need to exist.

The locking side effects could be done in NodeConnectionManager or inside the NodeConnection. However since this problem also exists in the VaultManager, I believe we want to do the side-effects as much as possible in the NodeConnectionManager. If locking during creation and destruction needs to occur, then it has to occur via callbacks that are passed into the NodeConnection.

NodeConnection

class NodeConnection extends GRPCClientAgent {
  // timeout acts like a TTL
  // the timeout must be refreshed everytime it is used
  // and while `withConnection` is in use, we can use a lazy lock for this
  public constructor(timeout) {
  }

  // every call is available here
  // but this is a public API
}

There may be a NodeConnectionInternal to have the actual methods while NodeConnection maybe a public interface similar to Vault and VaultInternal.

Usage

Any call to another keynode has to go through the NodeConnection. It is expected that callers will perform something like this:

async function doSomethingWithAnotherNode(nodeId) {
  await this.nodeConnManager.withConn(
    nodeId, 
    async (conn: NodeConnection) => {
      await conn.client.doSomething();
    }
  );
}

Such usage would occur in both NodeManager and other domains such as VaultManager.

Note that the conn.client.doSomething() may be conn.doSomething() instead. This depends on how NodeConnection is written.

Either way it is expected that doSomething is the same gRPC call available in GRPCClientAgent.

This means it is up to the domains to manage and handle gRPC call behaviour which includes the usage of async generators, metadata and exceptions. Refer to #249 for details on this.

The reason we are doing this is that abstractions on the gRPC call side are domain-specific behaviour. We wouldn't want to centralise all of the domain call gRPC abstractions into the nodes domain, that would be quite brittle.

This means the vaults domain could if they wanted to create their own wrapper over the connection object such as:

this.nodeConnManager.withConn(
    nodeId, 
    async (conn: NodeConnection) => {
        // vaultClient adds vault-specific abstractions on the gRPC calls
        const client = vaultClient(conn.client);
        await client.doSomethingWithVaults();
    }
);

Right now we shouldn't do any gRPC abstractions until #249 is addressed.

Lifecycle

The withConn method provides a "resource context". It's a higher order function that takes the asynchronous action that is to be performed with the connection.

It has a similar pattern to our poll, retryAuth, transact, transaction, _transaction wrapper functions.

The advantage of this pattern is that the node connection will not be closed (by the TTL) while the action is being executed.

The withConn will need to use the RWLock pattern as provided in src/utils.ts. It has to acquire a read lock when executing the action.

This enables multiple withConn to be executing concurrently without blocking each other.

At the end of each withConn, the TLL should be reset.

When the TLL is expired, it may attempt to close the connection. To do this, a write lock is attempted. If the lock is already locked, then we skip. If we acquire the write lock, we can close the connection.

If the client were to be closed or terminated due to other reasons, then an exception should be thrown when accessing the client or its methods.

This would be ErrorNodeConnectionNotRunning or ErrorGRPCClientAgentNotRunning or ErrorGRPCClientNotRunning.

Other domains should not be keeping around references to NodeConnection it should only be used inside the withConn context.

Hole Punching Relay

Hole punching is done between the ForwardProxy and ReverseProxy. Hole punching is not done explicitly in the NodeConnectionManager or NodeConnection.

Instead, hole punching relay is what is done by NodeConnectionManager. This means you relay a hole punching message across the network so that the target Node will perform reverse hole punching in case we need it. This process should be done at the same time as a connection is established.

This means NodeConnectionManager requires the NodeGraph, the same NodeGraph that is used by NodeManager.

The graph is needed to do optimal routing of the the hole punching relay message.

This means NodeConnectionManager would need to be used by createAgentService as the current keynode may be in the middle of a chain of relaying operations.

Public Interface

The public interface of NodeConnection can be a limited version of the internal class. This can hide methods like destroy from the users of NodeConnection. This ensures that only NodeConnectionManager can destroy the connection.

Dealing with Termination/Errors

The underlying GRPCClientAgent may close/terminate for various reasons. In order to capture this behaviour, you must hook into the state changes in the Client object. This will require changes to src/grpc/GRPCClient.ts, as it currently does not expose the Client object. Discussions on how to do this are in: https://github.com/matrixai/js-polykey/issues/224#issuecomment-966770905

NodeGraph

NodeGraph is a complex data structure. It appears to be needed by both NodeManager and NodeConnectionManager, but itself also needs to perform operations on NodeManager. Whenever there is mutual recursion, it can be factored out as another common dependency "absract the commonality". However if this is not suitable, then it can be done by passing NodeManager as this into NodeGraph.

However if NodeConnectionManager needs NodeGraph as well, then this is not possible unless NodeConnectionManager was also produced by NodeManager.

Further investigation is needed into the NodeGraph to understand how we can turn it into a data structure. Perhaps we can use a pattern of hooks where NodeGraph exposes a number of hook points which can be configured by NodeManager and thus allow NodeGraph to "callback" NodeManager and NodeConnectionManager.

If NodeGraph can be made purely a data structure/database, then:

The getClosestLocalNodes and getClosestGlobalNodes could be part of the NodeManager, and not the NodeGraph.
The seed nodes marshalling could occur in the NodeManager, and instead you just plug it with actual values.

There's only 3 places where NodeGraph is calling out to the world. All 3 can be done in NodeManager instead. That means NodeGraph no longer needs NodeConnectionManager.

Testing

For NodeConnectionManager and NodeConnection, it's essential that we don't duplicate tests for GRPCClientAgent as that would be handled by tests/agent. It must focus on just testing the connection related functionality. This is where mocks should be useful here, since we want to delay integration testing until the very end.

There are 4 kinds of mocks we have to use here.

Timer mocks - many code relies on timeouts, we must use timer mocking https://jestjs.io/blog/2020/05/05/jest-26#new-fake-timers to ensure that don't wait for the timeout.
Function mocks - as used in tests/bin/utils.retryAuth.test.ts, function mocks can be very useful way of providing a dummy function that we can later instrument. This can be useful if we pursue a hook pattern with NodeGraph.
Module/class mocks - as used in tests/bin/utils.retryAuth.test.ts (jest.mock('prompts')) this is high level mocking that should only be used when no other methods can be done. This is more complicated, but you can mock anything this way.
DI mocking, all of the dependencies talked about here should be DI-able, this should be used first before making use of Module/class mocks.

The tests for the nodes domain should be robust, reproducible, not flaky and complete within 10 - 30 seconds.

Larger integration tests should be done at a higher level, which will help us achieve #264.

This should ensure that nodes domain is more stable and does not break every time we change stuff in other domains.

Additional context

281 - Async Init Fixes that is applied in CARL & Session Management MR: https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213
194 - Seed Nodes Integration, application of DNS resolution is relevant here
249 - GRPC API Review
264 - Concurrent Testing Scalability

Tasks

Create NodeConnectionManager
Refactor NodeConnection into NodeConnectionManager and NodeManager
1. Add lock acquisition to destroy call of NodeConnection (the same process as construction): https://github.com/MatrixAI/js-polykey/issues/225#issuecomment-984262851
Factor out the commonality to prevent mutual recursion in NodeGraph and NodeManager and NodeConnectionManager
Introduce timer mocking to nodes testing.
Implement mocking for NodeConnection and NodeConnectionManager testing. Remove any tests that test agent service handling or forward proxy and reverse proxy usage. Rely on the other domain tests to handle this.
Fix up all domains that use NodeConnection to acquire their connections from NodeConnectionManager.

Consider the possibility that each "NodeConnection" object can be considered a "Data Access Object" that is actually handed to the relevant domains that require nodes to use. Then NodeManager is about maintaining a lifecycle of NodeConnection objects. This also means no other domains should maintain or keep the reference long-running. Node connection objects should only be short-running, and a pool of them is kept in the node manager.

This is similar to how database connection pools are done. See how async postgresql is done in Nodejs, or asyncpg in Python. Whether we are connecting to a database, or to another keynode, the architecture and style should be the same.

On 8/11/21 3:21 PM, Josh wrote:

  Requirements of this design
Eventually, we'd like to facilitate some discussion about the structure of the |nodes| domain. This would aim to improve the abstractions and separate shared classes of functionalities.
  Additional context
Currently the |nodes| domain has the following structure:

|NodeManager|: the top-level class that exposes API functions to the outside world. Contains and manages its own instance of |NodeGraph|, as well as handling all instances of |NodeConnection| objects. Currently, these functions are either: o public API 'relays' that internally call |NodeGraph| functions o connection-related functions (internally set up a |NodeConnection| object and then call |NodeConnection| functions)

|NodeGraph|: a data structure containing the implementation of Kademlia, used to distribute and store the node ID -> node address mappings in a keynode

|NodeConnection|: a class that handles a single client -> server connection between 2 keynodes (entry-point for agent to agent GRPC functions and message creation)

|NodeManager| is currently the only public-facing entrypoint to the nodes domain. Both |NodeGraph| and |NodeConnection| could be considered as a layer below |NodeManager|.

Roger and I were briefly discussing this structure when the idea of dependency injecting |NodeGraph| into |NodeManager| was brought up (in line with the other domains of Polykey). However, |NodeGraph| is currently taking |NodeManager| as a dependency in its constructor (passed as |this| when |NodeGraph| is constructed in |NodeManager|'s constructor):

// NodeManager.ts constructor({ ...}) { this.nodeGraph = new NodeGraph({
nodeManager:this, ... }) }

This is required such that |NodeGraph| can call the connection-related functionality specified in |NodeManager|.

A solution to resolve this mutual dependency would be to "abstract the commonality". We could potentially remove the connections functionality from |NodeManager| entirely, replacing it into a |NodeConnections| class (basically a manager specifically for the lifetime of |NodeConnection| objects).

But then we could ask ourselves what the point of this is? That is, what is |NodeManager| actually responsible for anymore?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MatrixAI/js-polykey/issues/225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE4OHM5TT6QG75PBJTS4VLT4ICG7ANCNFSM5B5QZ7AA.

I think we should follow an old-school "Connection Pool" design pattern. Basically how client-server databases like postgresql and mysql is provided to web applications.

On 8/11/21 3:21 PM, Josh wrote:

  Requirements of this design
Eventually, we'd like to facilitate some discussion about the structure of the |nodes| domain. This would aim to improve the abstractions and separate shared classes of functionalities.
  Additional context
Currently the |nodes| domain has the following structure:

|NodeManager|: the top-level class that exposes API functions to the outside world. Contains and manages its own instance of |NodeGraph|, as well as handling all instances of |NodeConnection| objects. Currently, these functions are either: o public API 'relays' that internally call |NodeGraph| functions o connection-related functions (internally set up a |NodeConnection| object and then call |NodeConnection| functions)

|NodeGraph|: a data structure containing the implementation of Kademlia, used to distribute and store the node ID -> node address mappings in a keynode

|NodeConnection|: a class that handles a single client -> server connection between 2 keynodes (entry-point for agent to agent GRPC functions and message creation)

|NodeManager| is currently the only public-facing entrypoint to the nodes domain. Both |NodeGraph| and |NodeConnection| could be considered as a layer below |NodeManager|.

Roger and I were briefly discussing this structure when the idea of dependency injecting |NodeGraph| into |NodeManager| was brought up (in line with the other domains of Polykey). However, |NodeGraph| is currently taking |NodeManager| as a dependency in its constructor (passed as |this| when |NodeGraph| is constructed in |NodeManager|'s constructor):

// NodeManager.ts constructor({ ...}) { this.nodeGraph = new NodeGraph({
nodeManager:this, ... }) }

This is required such that |NodeGraph| can call the connection-related functionality specified in |NodeManager|.

A solution to resolve this mutual dependency would be to "abstract the commonality". We could potentially remove the connections functionality from |NodeManager| entirely, replacing it into a |NodeConnections| class (basically a manager specifically for the lifetime of |NodeConnection| objects).

But then we could ask ourselves what the point of this is? That is, what is |NodeManager| actually responsible for anymore?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MatrixAI/js-polykey/issues/225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE4OHM5TT6QG75PBJTS4VLT4ICG7ANCNFSM5B5QZ7AA.

To add to this as well, currently we don't check for "liveness" of an active NodeConnection object. That is, while the networking domain cleans up any connections that have become inactive or offline, the nodes domain doesn't do such a thing. We do some kind of "activity" check in NodeManager::pingNode (checking an already active connection if it's still active, to ascertain whether the node is still online), but this doesn't clean any objects up, nor does it perform this operation on a regular basis.

Obviously, if we move to a connection pool pattern, then this will be handled with that.

Furthermore, we need to be correctly handling networking errors. From the nodes CLI fixes MR:

Networking errors are strangely managed here. The networking domain works a bit differently from the rest of the other domains. At any point in time, an exception may be thrown if the connection has timed out. In the networking domain this is managed by exception handlers that are attached to the relevant connections. The NodeManager cannot expect to "catch" these networking exceptions on a single control flow branch. They must attach handlers to each NodeConnection. This means the "handling" of exceptions itself is a stateful live operation. You must then design and consider the NodeManager as a state machine managing multiple connection state machines. Ensure that all errors caught are network domain level errors, you shouldn't be catching naked UTP_ETIMEDOUT errors.

Some notes regarding this from our sprint meeting a couple of weeks ago (9th August):

we can view the networking domain's components as singleton instances
that is, they act as an internal service that we share across our other domains
when we handle errors with respect to connections, we can't expect a single control flow like we would with a regular sequential program/function (e.g. do something and catch an error)
With connections, we have 2 kinds of errors:
1. During construction (can do a try-catch here - e.g. we currently do this when we ping a node, in the connection's start function)
2. During liveness of the connection object (these need to be handled with error handlers)
error handlers need to be attached to connection objects to handle this second type of error
furthermore, NodeConnection objects are interacting with the forward and reverse proxies. Relating to my previous comment, if the Connection object in the networking domain has thrown an error (and triggers some kind of event in the proxies), then we need to also handle/delete the corresponding NodeConnection object.

If we use the connection pool pattern, the above points need to be considered.

So we already have the concept of a "connection pool", whereby no other domains maintain or hold a long-running reference to a NodeConnection. All instances of NodeConnection are contained within a NodeConnectionMap type in NodeManager (see https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/205#note_699911271 for details of this map structure - it's used to enforce a locking when a connection is established). They're created as needed, and destroyed on NodeManager shutdown (as per previous comments, it also needs to be implemented such that are also destroyed upon connection closure in the proxies).

I've been thinking about the relationship of the nodes domain relative to other domains.

Currently the nodes domain gets injected to other domains to be used and those domains will often asks the nodes domain for node connections to work with.

I realise that there's a hierarchy here. Nodes domain is a lower level domain compared to vaults and, gestalt graph and related. High and low level relate to how user-facing it is. User operations include CLI and API calls to the PK agent which then are handled by a grpc server which will call these domains.

For example if vaults domain were at the same level as nodes domain, then it would mean that the call handler would interact with both domains directly and then pass any relevant objects from the nodes domain to the vaults domain. This is how I've often done my database structure like:

doSomething(dbConn, p1, p2)

The reason to do this was to be able to share db connection instance with multiple operations and to potentially allow the controller to control nested contexts relating to the connection such as a transaction context across multiple operations. Furthermore connection resource lifetime was tied to the controller handling rather than to specific model operations.

To a significant extent multiple domains may be involved in working against the same node in the same way multiple models may be working with the same DB connection.

If the nodes domain is considered something lower level that is injected into the different domains, it would be like constructing models by injecting a db dependency in them. This works as long as we think that the "controller" which in our case is the gRPC API handlers doesn't have to control nested contexts for a given node connection used by multiple domains.

This would mean that the domains like vaults and such is much more richer and performs controller-like functionality.

If we find out that there needs to be operations done across domains, how do we deal with this atm? Either we abstract another domain on top and then use that like we are doing with vaults domain on top of nodes and db domain, or we eventually flatten the hierarchy and do alot more operations at the grpc handlers.

The file structure right now lends towards a flattened hierarchy. But the internal connections through DI indicates a more complex hierarchy. It's usually better to have more flattened hierarchy and leave the composition at the very end for maximum flexibility but the trade off is more overhead and boilerplate.

I'll need to see how the vaults use the nodes domain along with other domains to really get a good feel of how to proceed.

@joshuakarp I think you have been focusing on nodes domain internal architecture and I'm also concerned about the relationship between nodes and other domains and how it will all work.

Furthermore, we need to be correctly handling networking errors. From the nodes CLI fixes MR:

Networking errors are strangely managed here. The networking domain works a bit differently from the rest of the other domains. At any point in time, an exception may be thrown if the connection has timed out. In the networking domain this is managed by exception handlers that are attached to the relevant connections. The NodeManager cannot expect to "catch" these networking exceptions on a single control flow branch. They must attach handlers to each NodeConnection. This means the "handling" of exceptions itself is a stateful live operation. You must then design and consider the NodeManager as a state machine managing multiple connection state machines. Ensure that all errors caught are network domain level errors, you shouldn't be catching naked UTP_ETIMEDOUT errors.

Some notes regarding this from our sprint meeting a couple of weeks ago (9th August):

we can view the networking domain's components as singleton instances

that is, they act as an internal service that we share across our other domains

when we handle errors with respect to connections, we can't expect a single control flow like we would with a regular sequential program/function (e.g. do something and catch an error)

With connections, we have 2 kinds of errors:

During construction (can do a try-catch here - e.g. we currently do this when we ping a node, in the connection's start function)

During liveness of the connection object (these need to be handled with error handlers)

error handlers need to be attached to connection objects to handle this second type of error

furthermore, NodeConnection objects are interacting with the forward and reverse proxies. Relating to my previous comment, if the Connection object in the networking domain has thrown an error (and triggers some kind of event in the proxies), then we need to also handle/delete the corresponding NodeConnection object.

If we use the connection pool pattern, the above points need to be considered.

Handling NodeConnection errors propagated from the network domain will be handled in #224.

I've turned this issue into an epic and added all the related issues that can all be solved in one go with some restructuring of the nodes domain. This was based on the discussion on the nodes domain test failures happening since we were reviewing the CARL, session management and async-init fixes #281 applied there.

I'll be fleshing out the specification here in the issue description regarding what we want to do.

Spec has been updated, please review. This is going to be the biggest loose end before release.

Implementation of this will require @tegefaulkes and @joshuakarp.

I've been playing around with some timer mocking for the existing tests (https://jestjs.io/docs/timer-mocks). Ideally I could get these reduced and working before we start refactoring, such that it's easier to diagnose a test failure from implementation as opposed to attempting to diagnose issues with this after refactoring nodes.

We can see the following test using runAllTimers to advance the setTimeout call (within sleep):

```ts import { sleep } from '@/utils'; describe('Timers', () => { test('testing', async () => { jest.useFakeTimers(); const s = sleep(10000); jest.runAllTimers(); await s; }); }); ``` ``` [nix-shell:~/Documents/MatrixAI/js-polykey]$ npm test -- tests/nodes/timers.test.ts > @matrixai/polykey@1.0.0 test /home/josh/Documents/MatrixAI/js-polykey > jest "tests/nodes/timers.test.ts" Determining test suites to run... GLOBAL SETUP PASS tests/nodes/timers.test.ts Timers ✓ testing (2 ms) Test Suites: 1 passed, 1 total Tests: 1 passed, 1 total Snapshots: 0 total Time: 2.569 s, estimated 3 s Ran all test suites matching /tests\/nodes\/timers.test.ts/i. GLOBAL TEARDOWN ```

For the ping test in NodeManager.test.ts, it's currently testing 3 situations in one test (such that we only need to setup a single PolykeyAgent instance):

An offline node returns a ping of false (requires a connection attempt to timeout)
An online node returns a ping of true (connection can be established)
A previously online node that goes offline returns a ping of false (existing connection needs to timeout)

What's interesting with the ping test though, is that we actually require this physical amount of time to pass, such that the underlying network connection can timeout. Trying to figure out how to mock that lower-level timeout.

We'll be merging #289 before beginning work on this refactoring effort. The seed nodes work is simple enough to merge, such that we don't over-complicate it with this refactoring.

As a very rough starting point, I'm estimating this would take about a week and a half (including testing and review). Currently scheduling it after the testnet deployment, from Monday Dec 6th to Wednesday Dec 15th.

As another supporting task, @emmacasolin could also be assigned to figuring out the timer mocking for the nodes tests with this issue (this already has some relation to #264 too). Would also be a reasonable introduction to the nodes domain.

@tegefaulkes will most likely be on leave when we're tackling this issue, so for now have unassigned him.

For our NodeConnection structure using the ObjectMap locking system (acquiring a lock on creation), this will also need to be added for destruction of NodeConnections (i.e. acquire the lock before destruction). Also discussed here regarding vaults: https://github.com/MatrixAI/js-polykey/pull/266#issuecomment-984262113

There was also a point brought up whether it would be worthwhile to refactor the nodes domain to use the start-stop pattern, instead of the current create-destroy-start-stop. From https://github.com/MatrixAI/js-polykey/pull/289#issuecomment-992016820, we've had to move some of the "starting" functionality of the nodes domain directly into the call to PolykeyAgent.start (as a result of some necessary dependencies on the ForwardProxy being started prior to these calls). If we were to move to start-stop, our call to NodeManager.start (and other such classes) would be purely found in PolykeyAgent.start (instead of internally called from within createPolykeyAgent with the call to NodeManager.createNodeManager).

Yes it just means you would be doing new NodeManager and related in the creation phase. But then actually do nodeManager.start() in the start phase. Similar to the networking domain objects.

A quick note for myself/@emmacasolin: when looking into connection timeouts for testing, have a read of this comment too https://github.com/MatrixAI/js-polykey/pull/289#issuecomment-992080275.

There was also a point brought up whether it would be worthwhile to refactor the nodes domain to use the start-stop pattern, instead of the current create-destroy-start-stop. From #289 (comment), we've had to move some of the "starting" functionality of the nodes domain directly into the call to PolykeyAgent.start (as a result of some necessary dependencies on the ForwardProxy being started prior to these calls). If we were to move to start-stop, our call to NodeManager.start (and other such classes) would be purely found in PolykeyAgent.start (instead of internally called from within createPolykeyAgent with the call to NodeManager.createNodeManager).

I've been having a think about this. Right now:

NodeManager
NodeGraph
NodeConnection all use the create-destroy-start-stop pattern.

Given that NodeGraph is the only domain that actually has any database state, I believe everything else could be safely changed to using start-stop. (Although, NodeGraph is an encapsulated dependency of NodeManager. I imagine this means that we'd need some kind of destroy call in the top-level NodeManager, such that it can propagate down to NodeGraph. This is how it's currently structured.)

Specifically with respect to the vaults domain, this refactoring effort shouldn't actually cause too many changes there.

Previously, the vaults domain was already (previously incorrectly) performing operations on the GRPCClientAgent directly, instead of calling internal NodeConnection functions (calling a getClient function, which returned the client, and then performing operations on the client directly):

      const nodeConnection = await this.nodeManager.getConnectionToNode(
        pullNodeId!,
      );
      const client = nodeConnection.getClient();
      // ... do operations on client

This functionality is actually what we intend to move towards, and as such, shouldn't require too many changes. We'll instead just need to wrap the operations on the client inside NodeConnection's withConnection higher-order function.

Notably, in the vaults domain, the following will need to be checked and changed:

pullVault
cloneVault

The actual API of the vaults domain remains unchanged though. We simply inject NodeConnectionManager instead of NodeManager.

After some discussion, a proposed architecture diagram:

                              ┌──────────┐
                              │ Sigchain │
                              └────▲─────┘
                                   │
                                   │
                                   │
                            ┌──────┴──────┐
                            │ NodeManager ├──────────────────┐
                            └──────┬──────┘                  │
                                   │                         │
                                   │                         │
                                   │                         │
                       ┌───────────▼───────────┐             │
┌──────────────┐       │                       │             │
│ ForwardProxy ◄───────┤ NodeConnectionManager │             │
└──────────────┘       │                       │       ┌─────▼─────┐
                       │   ┌──────────────┐    ├───────► NodeGraph │
┌──────────────┐       │   │NodeConnection│    │       └───────────┘
│ ReverseProxy ◄───────┤   └──────────────┘    │
└──────────────┘       │                       │
                       └─────▲─────▲────▲──────┘
                             │     │    │
                             │     │    │
                  ┌──────────┘     │    └─────────────┐
                  │                │                  │
             ┌────┴────┐    ┌──────┴─────┐  ┌─────────┴──────────┐
             │Discovery│    │VaultManager│  │NotificationsManager│
             └─────────┘    └────────────┘  └────────────────────┘

In this architecture some desired APIs:

NodeConnectionManager - findNodes, getClosestGlobalNodes, getClosestLocalNodes, these are extracted from NodeManager and NodeConnectionManager, it also facilitates hole punching
NodeManager - focuses on nodes-related high level functionality like sigchain operations, claiming operations and mediates access to the node graph, it has to use NodeConnectionManager when establishing connections to do work
NodeGraph - stays focused on database operations as much as possible, and is used by both NodeManager and NodeConnectionManager
GRPC handlers - for managing the current node or other nodes, it would use NodeManager, for receiving instructions about hole punching it would use NodeConnectionManager, so some handlers may end up using both
Discovery may use NodeManager or NodeConnectionManager or both, not sure as it needs to deal with both sigchain and traversing the P2P network

All other domains should be only using NodeConnectionManager to acquire the connections and do work on them. Order of operations.

NodeGraph - CDSS - because this is now shared between NodeConnectionManager and NodeManager, this has to be created outside of NodeManager as a singleton and shared - created first
NodeConnectionManager - SS or CDSS or CD - created second
NodeManager - SS or CDSS or CD - created third

Note that NodeConnectionManager requires ForwardProxy and ReverseProxy, and these are started with knowledge only after starting the GRPC servers. But these can be passed during creation.

@joshuakarp we will need specced out APIs for of the nodes classes.

Some notes from our discussion about comparing its API to the vaults domain:

1. Review ID Usage

2. NodeConnectionManager Refactoring
3. Networking and NodeConnection Errors
4. Transaction Usage in NodeGraph

5. NodeGraph Testing

6. P2P Integration Testing Including Hole Punching

NodeConnection
  GRPCClient -> ForwardProxy -> ReverseProxy -> GRPCServer

class Domain {
  // retries are domain specific, we wouldn't auto retry for the node connection manager
  public async doSomething() {
    nodeConnectionManager.withConnection(async (conn) => {
    await sleep(1000);
    // the conn is actually "destroyed" because there was a state change in the underlying socket
    await conn.X(); // this throws a js-async-init exception ErrorNodeConnectionNotRunning()
    });
  }
}

class NodeConnectionManager {
  connections: Map();
  public withConnection() {
    const conn = new NodeConnection(onDestroy);
  }
}

class NodeConnection {
  // This may need to make use of locking
  async createNodeConnection(connections) {
    this.connections = connections;
    const client = new GRPCClientAgent(onError);
  }

  // You have to lock this as well, similar to the Vault lifecycle
  async destroy() {
    // delete itself from the node connections state
    this.connections.delete(nodeId);
    // this is an alternative handler passed in from NodeConnectionManager
    this.onDestroy();
  }

  onError = () => {
    await this.destroy();
  };
}

class GRPCClientAgent extends GRPCClient {

}

class GRPCClient {
  async createGRPCClient(onError) {
    this.client.watchConnectivityState(currentState, deadline, async () => {
      await onError();
    });
  }
}

// The vaults API is an interesting comparison as they have a low level API
public vaultManager.createVault(): Promise<void>
protected vaultManager.openVault()
protected vaultManager.closeVault()
public vaultManager.destroyVault(): Promise<void>

// If we follow `NodeConnectionManager`, we may also expose a higher level callback API
vaultManager.withVault(async (v1) => {
  vaultManager.withVault(async (v2) => {
    vaultOps.mv(v1, v2);
  });
});

// Example of using this with vault ops which takes the vaults
vaultOps.mv(v1, v2) {
  // dual contexts, nested commits, similar to python's `with`
  // watch out, what happens if v1 and v2 are the same vault?
  withCommit([v1, v2], async ([fs1, fs2]) => {

  });
}

MatrixAI / Polykey