tegefaulkes commented 2 years ago

Specification

With the application of timedCancellable to functions within the nodes domain. We have encountered new connection errors randomly during tests. These error started after applying a deadline of 1500 ms to pingNode usage. We suspect the problem is coming from openConnectionForward after setting such a short deadline. These may be bugs that existed previously that was never revealed by the current testing. Currently there is 1 NodeConnectionManager test where we override the deadline for openConnectionForward.

The occur randomly and are as follows. the can be re-created by running all of the nodes domain tests concurrently.

``` Error: Socket is closed at UTP.Object..UTP.send (/home/faulkes/matrixCode/polykey/js-polykey/node_modules/utp-native/index.js:121:50) at /home/faulkes/matrixCode/polykey/js-polykey/src/utils/utils.ts:178:9 at new Promise () at UTP. (/home/faulkes/matrixCode/polykey/js-polykey/src/utils/utils.ts:163:12) at constructor_.send (/home/faulkes/matrixCode/polykey/js-polykey/src/network/Connection.ts:79:11) at Timeout. (/home/faulkes/matrixCode/polykey/js-polykey/src/network/ConnectionReverse.ts:147:20) at listOnTimeout (node:internal/timers:559:17) at processTimers (node:internal/timers:502:7) Error: Unhandled error. (Error: UTP_ECONNRESET at createUTPError (/home/faulkes/matrixCode/polykey/js-polykey/node_modules/utp-native/lib/connection.js:238:15) at Connection.Object..Connection._onerror (/home/faulkes/matrixCode/polykey/js-polykey/node_modules/utp-native/lib/connection.js:175:16) { code: 'UTP_ECONNRESET', errno: 1 }) at new NodeError (node:internal/errors:372:5) at Connection.emit (node:events:516:17) at Connection.Object..Connection._onclose (/home/faulkes/matrixCode/polykey/js-polykey/node_modules/utp-native/lib/connection.js:169:25) Error: address already in use at UTP.Object..UTP.bind (/home/faulkes/matrixCode/polykey/js-polykey/node_modules/utp-native/index.js:198:13) at /home/faulkes/matrixCode/polykey/js-polykey/src/utils/utils.ts:178:9 at new Promise () at UTP. (/home/faulkes/matrixCode/polykey/js-polykey/src/utils/utils.ts:163:12) at Object. (/home/faulkes/matrixCode/polykey/js-polykey/tests/nodes/NodeManager.test.ts:83:11) ```

Additional context

Related discussion starting at https://github.com/MatrixAI/Polykey/pull/445#issuecomment-1253158778
Related #445
Related #243
Related #242

Tasks

Explore and find the root cause of the errors
Apply a fix for the errors

tegefaulkes commented 2 years ago

This is going to be a pretty large change. Converting everything to TimedCancellable is pretty simple. there is just a lot of things to change and cover.

Nodes domain handles a bunch of connections functionality for agent-agent communication. These have been converted to TimedCancellable already but needs to be reviewed again.
Cancellablility needs to be integrated with the GRPC calls. I need to research how to do this, it should be possible to abort a connection. Generalizing this for all GRPC calls and error handling needs to be prototyped.
Server side deadlines and abortion needs to be implemented as a complement to the client side aborting and deadlines.
Proxy connections need to be updated to use the new style of cancellability.

tegefaulkes commented 2 years ago

I don't think the default timeout for the timedCancellable decorator can be set at any time. It has to be defined at import time. This means we can't configure a default time when starting the NodeConnectionManager. We also don't want to have hard coded timeouts everywhere so we may have to define the defaults in the config file.

CMCDragonkai commented 2 years ago

There are several places where timedCancellable and variants of this should be used. Basically anywhere where there is non-deterministic side-effects.

Pretty much all of them concern around network usage.

Deadlines applied to GRPC client
Deadlines applied to GRPC service handlers
Propagating deadline cancellation to long-running computations and network calls
Locking - this currently requires the Promise.race work arounds until https://github.com/MatrixAI/js-async-locks/issues/21

For long running computations, we're talking about generators and related iterators. In these cases, the cancellation of generators hasn't been entirely worked out yet, and instead it is expected that you would stop consuming the generator (if you don't ask for the next item, it would continue computing). The only case where cancellation applies is when you do ask for the next item, but you must await for it, in which case you'd like to have the ability to do p.cancel(). However atm, we don't have this ability yet, so instead, you may need to pass an overall signal to the generator function and for it to signal cancellation the traditional way.

Remember similar to async f functions, I believe async f* forces them to be regular promises and not our extended promises. But I don't remember if I confirmed this fact.

As for the network, pretty much all network operations should be cancellable or timed cancellable if there's a timeout. This requires a big major refactor of the Proxy code to support this. There are multiple "default" timeouts currently configured during class construction. If we use the decorators, the defaults has to be set statically and cannot take in instance parameters.

Alternatively you can also not use the decorators, and just directly use PromiseCancellable while taking in the signal. That would allow you to take the instance/constructor parameters as the defaults for the Proxy.

So we have to decide whether to use static parameters and the decorators, or whether to swap to just using the PromiseCancellable and taking in the signal directly.

It's important to understand that the decorators are for convenience, but they are not infinitely flexible, it trades off some flexibility for some better convenience.

CMCDragonkai commented 2 years ago

I don't think the default timeout for the timedCancellable decorator can be set at any time. It has to be defined at import time. This means we can't configure a default time when starting the NodeConnectionManager. We also don't want to have hard coded timeouts everywhere so we may have to define the defaults in the config file.

Yes, because if decorators are static, then the parameters they take are static too. This is of course only necessary for "default" deadlines. Whatever is passed into the method/function will override it. If we do use static deadlines, then it does make sense to propagate these static parameters directly from the config object.

I discussed that you can also use signals and promise cancellable directly in the method to avoid static parameterisation...

CMCDragonkai commented 2 years ago

I suggest we do a priority here.

Start with the network domain, we should refactor this to support timers and cancellability (this may mean we can remove the existing timer utility)
Refactor the tests to use fast check as well, and to change the timeouts and see how it behaves.
Then work onto the GRPC client and server side deadlines

Also additionally there's the identities domain, which also has network side effects from contacting identity providers, these should also start using timed cancellable too.

CMCDragonkai commented 2 years ago

I'm thinking there's sort of like 3 levels of defaults here:

Static defaults on the decorators themselves inherited from src/config.ts
Instance defaults coming from the construction of the class
Method parameters that are passed from the caller

I'm not sure if 2 is possible if we use the decorators, unless we change them up to look up parameters on the instance if the method parameter isn't passed in (that is is set to undefined). But we would need to pass in a string to indicate the name of the parameter to look up.

CMCDragonkai commented 2 years ago

I think it is possible to do all 3.

This is the function head:

function timedCancellable(
  lazy: boolean = false,
  delay: number = Infinity,
  errorTimeoutConstructor: new () => Error = contextsErrors.ErrorContextsTimedTimeOut,
)

It's possible to do something like:

function timedCancellable(
  lazy: boolean | string | symbol = false,
  delay: number | string | symbol = Infinity,
  errorTimeoutConstructor: new () => Error = contextsErrors.ErrorContextsTimedTimeOut,
)

And then do timedCancellable('lazyProp', delaySym). And later use the this[lazy] and this[delay] to acquire the parameters.

// If it was `undefined`, it would be set to `false`
if (typeof lazy !== 'boolean') {
  lazy = this[lazy];
}

// If it was `undefined`, it would be set to `Infinity`
if (typeof delay !== 'number') {
  delay = this[delay];
}

However this pattern may not be scalable to other decorators which have actual string/symbol parameters...

So it may be better to actually have a separate set of parameters to refer to the instance.

function timedCancellable(
  lazy?: boolean,
  delay?: number,
  errorTimeoutConstructor: new () => Error = contextsErrors.ErrorContextsTimedTimeOut,
  defaults?: {
    lazy: 'lazyProp';
    delay: 'delayProp';
  }
)

But I find this syntax way too verbose...

We could also use a more different type like arrow functions.

function timedCancellable(
  lazy: boolean | (this: any) => boolean = false,
  delay: number | (this: any) => number = Infinity,
  errorTimeoutConstructor: new () => Error = contextsErrors.ErrorContextsTimedTimeOut,
)

It'd be unlikely for a decorator parameter to be an arrow function, but if they needed to differentiate, they could also check the number of parameters... etc.

Notice the this is literally any here.

Using it would be like:

class X {
  lazyProp = true;

  @timedCancellable((this) => this.lazyProp)
  method() {

  }
}

CMCDragonkai commented 2 years ago

The last style is likely the more flexible.

CMCDragonkai commented 2 years ago

I've added related issues #243 and #242. They should be addressed together here.

If there are going to be multiple PRs here, this should be made an epic and with subissues and subPRs.

CMCDragonkai commented 2 years ago

Full work on this might take 1 sprint and a bit.

CMCDragonkai commented 2 years ago

The todos relating to cancellability should also be done in this issue. This issue is an epic so there are multiple subissues. @tegefaulkes you can add more issues into this epic as you see fit.

tegefaulkes commented 2 years ago

I think we can split up the work in order of:

Updating decorators to support dynamic timeouts
Updating Proxy
nodes domain connection handling
GRPC client calls timeouts and service handler timeouts. This part will be wide reaching due to the number of handlers.
Locking

tegefaulkes commented 2 years ago

It might be better to address the locking along side the other issues when I encounter any locks.

tegefaulkes commented 2 years ago

@CMCDragonkai pointed out that we want to reduce the punch packet interval. We can set it to something like 100 or 50 ms.

We can also send a whole bunch of packets at once when trying to connect. In my experience however if you commonucate too much with a closed port some firewalls will start to ignore you. With a good connection you only need to send 2 packets each way within a second to establish a connection through a NAT. For connections with packet drop you;ll need to send more to make up for the drop but you don't need a crazy amount.

tegefaulkes commented 2 years ago

382 overlaps this epic but doesn't fully fit within the scope. I suggested that #382 be an epic with sub tasks, both epics can share the tasks for network and nodes domain.

tegefaulkes commented 2 years ago

I thought I attached more issues to this epic earlier, I need to fix that.

tegefaulkes commented 2 years ago

Need a new sub issue for Identity manager connections. This is also needed for discovery.

tegefaulkes commented 2 years ago

As discussed before, the creation of a NodeConnection with createNodeConnection doesn't convey the timeout to the proxy connection. The proxy connection is started automatically when it receives a connection from the NodeConnection. This means we can end up with an un-composed proxy connection if the NodeConnection creation times out.

We can fix this by having the NodeConnection start the ForwardConnection directly with the timeout. Then it can connect to the proxy and compose like normal.

This is not a critical change, currently if a uncomposed forward connection exists it will be cleaned up after a TTL period of no activity.

CMCDragonkai commented 2 years ago

Let's keep it the way it is... the proxy's internal connection logic is somewhat separated by the Node Connection and GRPC connection. This is due to how we are using HTTP connect protocol to connect these 2 disparate parts.

When we get around to working on QUIC or WebRTC #234 and replacing GRPC with jsonrpc in #249, then we won't need any workarounds, everything should just be tightly integrated.

So if it can still work with the TTLs, let's just continue doing this.

It does make sense that it's possible to create proxy connections without there being a corresponding node connection. That's the main issue here. And I think we can continue with this design.

tegefaulkes commented 2 years ago

Created issues #477 and #476 for discovery and identities.

CMCDragonkai commented 1 year ago

I'd want to introduce the ability to set { timer: 1000 } or some number parameter for convenience purposes.

CMCDragonkai commented 1 year ago

So this means when we move to JSON RPC, we can add another parameter to our params object, which is the timeout.

If you don't set timeout, then the handler may default to something else, or my default to infinite time.

We can have these 3 meanings:

timeout: undefined; // default
timeout: null;  // infinite
timeout: 1000; // 1000ms

It might be good idea to have timeout: null; available. Maybe only a select few calls can accept this. Because it is ripe for abuse.

tegefaulkes commented 1 year ago

I think given that we should only allow for the default timeout or a specified timeout that is less than the default. Any values larger than the default can be clipped to the default value. Or if we want to avoid the implicit behaviour we can throw an error for a bad value.

A slight problem with this is that we currently don't enforce any timeouts in the RPC system. And the only way to convey timeouts is through the metadata which the RPC system doesn't care about or even perceive as it's part of the middleware or user implementation.

unless we add timeout parameter to the message structure then it's up to the middleware to handle the timeout logic. But currently the middle ware can't affect any timeouts or handler cancellation. It's meant to short circuit and reject the current stream immediately.

If we want to allow the middle ware to affect cancellation then we need to pass it an abort controller and have it handle it's own abortion logic. That way we can create timeout or cancellation logic inside of the middleware. while we're at it we may as well provide the middleware with other contextual information such as the ConnectionInfo.

CMCDragonkai commented 1 year ago

I think it depends on a few things.

We don't want callers to DOS the agent.
Callers also should be setting a timeout because they don't want non-responsive agents to hold themselves up.
Therefore timeouts make sense for both sides.

The timeout is then a proxy for the amount of work that should be accepted by the server, and the expected deadline by the agent to complete.

The timeout should be negotiated but then clients can ask for a shorter timeout or for a longer timeout but only up the maximum timeout.

How do timeouts work for streams? Is it the timeout for the entire stream or only for a single message of the stream? How does grpc implement deadlines for their streams? I would prefer timeouts per message because it measures liveness and can be used for infinite streaming if that's what we are trying to implement.

Therefore each rpc method should have a max timeout on the server side.

The caller if not specifying a timeout can default to max. But can specify a lower timeout.

CMCDragonkai commented 1 year ago

I don't think the middleware should be dealing with timeouts.

The timeouts should be part of the message params just like other metadata.

The timeout info must be passed into the subsequent methods. There are some methods that will cost time but won't take a timeout. These are synchronous methods...

For all other async methods a timeout can be used to cancel any operation.

So technically this means the timeout isn't 100 percent accurate since it won't cancel sync operations on the server side.

On the client side though, the timeout would just cancel the call and ignore any data. The only way to deal with this is an exception on the server side which could cancel sync operations.

CMCDragonkai commented 1 year ago

The deadline mechanism works in the JSON RPC system now. It's just a matter of doing agent migration. But also ensuring that the subissues here are completed.

CMCDragonkai commented 1 year ago

Applying it to sigchain depends on agent migration.

Applying it to identity management only requires bridging it to the underlying IO which can be just the fetch function provided by cross-fetch.

tegefaulkes commented 1 week ago

This issue is very old and predates the network stack overhaul. I'm going to close it as not planned.

CMCDragonkai commented 1 week ago

This is still required. It's independent of networking. It's basically any kind of operation that is meant to be timed and cancellable. It's a general pattern. Of course new issues should be more specific.

MatrixAI / Polykey

Apply `timedCancellable` across the board to control how long side-effects are allowed to complete #450

Specification

Additional context

Tasks

382 overlaps this epic but doesn't fully fit within the scope. I suggested that #382 be an epic with sub tasks, both epics can share the tasks for network and nodes domain.