MatrixAI / Polykey-CLI

Polykey CLI - Open Source Decentralized Secret Sharing System for Zero Trust Workflows
https://polykey.com
GNU General Public License v3.0
6 stars 3 forks source link

Integration Tests for `testnet.polykey.com` #71

Closed emmacasolin closed 8 months ago

emmacasolin commented 2 years ago

Specification

We need a suite of tests to cover interacting with a deployed agent, which we can do using testnet.polykey.io. These tests need to cover various different connection scenarios:

These tests should go into their own subdirectory tests/testnet and should not be run with the other tests. They should be disabled in our jest config and should only run when explicitly called (which will happen during the integration stage of our pipelines).

Required tests:

Additional context

Tasks

Emergent bugs

CMCDragonkai commented 2 years ago

Seems like our tests aren't simulating long-running behaviour. This is unfortunate, because certain bugs only became apparent after running the nodes on the testnet for longer than 1 hr.

We might need to increase our modelling tooling to help with correctness. As well as monitoring tools/remote debugging tools applied to our testnet to enable observation of memory and CPU cycles. Nodejs provides this via their debugging port. This could be enabled on our testnet nodes. Possibly bootstrapping off the client port, and related to MatrixAI/Polykey#412.

emmacasolin commented 2 years ago

I'm also noticing that the re-connection attempts discussed in MatrixAI/Polykey#413 and MatrixAI/Polykey#415 don't appear when I use the same setup locally (two local agents where the one that gets started first is set as a seed node for the second one), so even if we had tests to simulate long-running behaviour they may not have picked up these issues since they may be limited to the testnet environment.

CMCDragonkai commented 2 years ago

Also unlike our NAT tests which use conditional testing with describeIf and testIf, these tests are conditional on a certain stage in our pipeline. We may continue to use this technique rather than explicitly excluding the tests in the jest config (or using explicit group tagging https://stackoverflow.com/questions/50171932/run-jest-test-suites-in-groups and https://morioh.com/p/33c2bd031589).

This approach means you use describeIf as well, but then depend on a condition that is only available via environment variables. This would be similar to @tegefaulkes work with docker integration tests which rely on a special test command environment variable.

We could re-use NODE_ENV or create our own env variable that is then funnelled into the jest.config.js as global parameters.

CMCDragonkai commented 2 years ago

The only difference in the testnet environment is running in a docker container. If the docker container is producing different behaviour, that has to be illuminated with the docker integration testing MatrixAI/Polykey#407.


Caveat: when using testnet.polykey.io you are using NLBs as well... so that adds extra complexity. But this is why we are sticking to the public IPs first.

CMCDragonkai commented 2 years ago

@emmacasolin you can always run your own docker container locally and set that up as your local "seed node" and have it continuously run while you write tests against it.

Just make sure to feed it all the required env variables, or the parameters and mount the necessary namespaces. Meet with @tegefaulkes about this, he's already doing this currently.

CMCDragonkai commented 2 years ago

Also unlike our NAT tests which use conditional testing with describeIf and testIf, these tests are conditional on a certain stage in our pipeline. We may continue to use this technique rather than explicitly excluding the tests in the jest config (or using explicit group tagging https://stackoverflow.com/questions/50171932/run-jest-test-suites-in-groups and https://morioh.com/p/33c2bd031589).

This approach means you use describeIf as well, but then depend on a condition that is only available via environment variables. This would be similar to @tegefaulkes work with docker integration tests which rely on a special test command environment variable.

We could re-use NODE_ENV or create our own env variable that is then funnelled into the jest.config.js as global parameters.

If we continue down the path of reusing describeIf, it would be optimal to move our imports to be asynchronous under the describe. That would be similar to our bin commands where we use dynamic imports. Problem is, describe doesn't support asynchronous callbacks. A long term solution is to use top-level await when it becomes available in jest: https://github.com/facebook/jest/issues/2235#issuecomment-585195125

emmacasolin commented 2 years ago

I'm also noticing that the re-connection attempts discussed in MatrixAI/Polykey#413 and MatrixAI/Polykey#415 don't appear when I use the same setup locally (two local agents where the one that gets started first is set as a seed node for the second one), so even if we had tests to simulate long-running behaviour they may not have picked up these issues since they may be limited to the testnet environment.

This is actually incorrect. I left my local setup going in the background for several hours and both agents had attempted multiple node connections to the other agent when I checked back (so many that they were cut off). At one point when I checked one of the agents was displaying these logs (rapidly) and the other was silent but now both of them are silent. This might be from the refresh buckets queue if this is something that happens every hour? I'm going to try reducing the time between refreshes and adding timestamps to the logs.

As a side note, I think these logs appear to be infinite and constant on the testnet because it's a lot slower than my local machine, so it's only able to attempt a connection every 20 seconds, and since it takes so long to get through them it's refreshing them again by the time it's finished.

emmacasolin commented 2 years ago

I think I know what the main issue causing our "infinite loop" is. The timeout for opening a forward proxy connection is 20 seconds, so if we try to connect to a node that is offline it blocks the refresh buckets queue for 20 seconds. The refresh timer for the refresh buckets queue is an hour (i.e. buckets are added to the queue every hour). 1 hour / 20 seconds is 180 nodes per hour if we try to connect to an offline node for every bucket (which is the case if we only have one node in our node graph and it's offline). Since there are 256 buckets, this means we won't get through all of the buckets within the hour, and buckets will begin to be added to the queue again at the same rate that they're removed. So the queue will have 256-180=76 buckets in it forever (until the node in our node graph comes back online).

I'm not sure if this blocks the entire event loop as well, in which case this is definitely a problem.

tegefaulkes commented 2 years ago

The refresh bucket and ping node queues work asynchronously in the background. I don't think it will block the event loop.

The excessive contacting of nodes as part of the refresh bucket queue is not ideal. I think we do need to optimise this but the problem is, how? We need a way to determine if we are not gaining any new information and just stop refreshing buckets for a while. But right now the real problem is that we're attempting to contact the same offline node over and over again. Right now two things come to mind to address this.

  1. Just use a smaller timeout. 20 seconds seems a bit much.
  2. Track the last time we tried to contact a node and maybe the attempts. If we keep trying to contact it then we can skip it for a period of time.

Just a note that more aggressive removal of nodes from the node graph such as removing a node if we see it's offline would fix this. However this will lead to use removing nodes that may be temporarily offline. Or worse, if a node's network goes down it will clean out it's nodeGraph.

CMCDragonkai commented 2 years ago

Pleas create a new PR to tackle this issue, you may wish to incorporate the subissues within this epic too.

CMCDragonkai commented 2 years ago

This takes over from MatrixAI/Polykey#159. The last few comments is useful https://github.com/MatrixAI/Polykey/issues/159#issuecomment-1180056681 regarding any re-deployment of testnet.

CMCDragonkai commented 2 years ago

@emmacasolin please check MatrixAI/Polykey#148 in relation to these tests, what needs to be done for that issue.

CMCDragonkai commented 2 years ago

These tests should go into their own subdirectory tests/testnet and should not be run with the other tests. They should be disabled in our jest config and should only run when explicitly called (which will happen during the integration stage of our pipelines).

This is because these tests call out to the external network. Our check stage unit tests should not require external network access for running those tests, as in those tests should pass even when offline.

In MatrixAI/Polykey#435 I'm proposing the use of directories to represent "allow lists".

Because we now have groups of tests that we want to run during "check stage" (even cross-platform check stage) and groups of tests that we want to run as part of "integration stage", then these tests are part of the integration stage, as they test the integration with testnet.polykey.io.

Our initial group should just be tests/integration to indicate integration tests. In there, for this epic, we should have tests/integration/testnet.

During integration testing (where it is testing each platform), it will also on top of this test against the testnet as well.

So for example @tegefaulkes during the docker integration tests, it will not only be testing tests/integration/docker, but also tests/integration/testnet. While for windows it would be tests/integration/windows and tests/integration/testnet.

We also nix integration testing, which should be testing tests/integration/nix and tests/integration/testnet.

Right now integration testing would mostly reuse tests from tests/bin (and until MatrixAI/Polykey#435 is resolved, it cannot really change to testing tests/integration/testnet).

Once this is setup, we can evaluate whether all the regular unit tests including tests/bin should be moved down one directory to tests/check.

CMCDragonkai commented 2 years ago

Now these tests may still fail, so you need to write stubs for all the preconditions and postconditions, taking into account MatrixAI/Polykey#403, and also the changes we will be making in MatrixAI/Polykey#329.

The PR for the issues within this epic should be targeting staging branch but should be cherry picking changes that are occurring in MatrixAI/Polykey#419 as that's where the initial new DB will be applied and many concurrency issues resolved. Myself and @tegefaulkes will be focusing on MatrixAI/Polykey#419 while @emmacasolin will be working on this issue.

Finally MatrixAI/Polykey#434 and MatrixAI/Polykey#432 should be done first and merged into staging before attempting this PR.

Focus on making a start on all the test cases even if they will be failing for now, fixes should be pushed into staging, or assigned to MatrixAI/Polykey#419.

CMCDragonkai commented 2 years ago

I added MatrixAI/Polykey#388 as a subtask of this.

tegefaulkes commented 1 year ago

As discussed just now.

With the use of domains containing multiple A records E.G. our testnet testnet.polykey.io it seems evident that the mapping of NodeIds to IPs is a many to many relationship.

Given a node graph with the following mappings, we can end up with 4 cases that express these relationships.

NodeGraph<NodeID, Host> {
  // C1 The same node ID on different IPs (NOT POSSIBLE)
  NID2 -> IP1,
  NID2 -> IP2,
  // C2 Multiple NIDs on the same IP
  NID5 -> IP3,
  NID6 -> IP3,
  // C3 The same node ID on different hostnames (NOT POSSIBLE)
  NID1 -> HOSTNAME1,
  NID1 -> HOSTNAME2,
  // C4 Multiple NIDs on the same host name
  NID3 -> HOSTNAME3,
  NID4 -> HOSTNAME3,
  // It's possible to have unions of all 4 cases
  NID7 -> IP4,
  NID7 -> IP5,
  NID7 -> HOSTNAME1,
  NID8 -> IP4,
  NID8 -> HOSTNAME2,
  NID9 -> IP4,
  NID9 -> HOSTNAME1
  NID9 -> HOSTNAME2
};

The NG isn't aware of the gestalt graph. And neither is it aware of certificate chain relationship. So it's not aware of both axes. Both axes will be dealt with by other systems.

To fully support all of this we need to apply 2 thing.

  1. When establishing a connection to a node, if we are given a host name/domain we need to resolve this to one or more IPs. With this set of IPs we need to attempt connections to each one with some degree of concurrency. When we find the node we are after we use that connection. All other connections should be ignored, either cancelled, failing to connect or rejected based on verification.
  2. For general network entry we need to support connecting to a hostname/domain with the expectation that we find one or more of a set of provided nodes. So we can connect to a set of nodes on a single domain such as the testnet.
For the user to attempt a connection to multiple [NID7, NID8, NID9] with a set of IPs/connections.
--seed-nodes="NID1@testnet.polykey.io:1314;NID2@testnet.polykey.io:1314"

  // NETWORK ENTRY
  Promise.allSettled([
    nodeManager.getConnection(nid1),
    nodeManager.getConnection(nid2),
  ]);

Ultimately the idea is that if we go looking for a node we can find that specific node. Each node is unique and what we're after when connecting to nodes is reasonably specific to that node. Kademlia and NAT punch-through depends on this.

The problem with using NLBs here is that we end up with a situation where we have multiple nodes on the same IP and port with no way to discerningly connect to one or the other intentionally. This is not so much a problem when entering the network where we only care about contacting any node that we can trust. for any other interaction we depend of information and state unique to the node we're looking for.

tegefaulkes commented 1 year ago

For the infrastructure. the go to structure is that we set up and EC2 instance that runs a single node. To make this work with the ECS we need to apply a few things.

  1. The EC2 needs a role with the permission to allow it to work with the ECS.
  2. The EC2 instances need to have some configuration applied with the user data field. This is a small script that sets a variable for the ECS network in a config.
  3. The EC2 instance needs to use a AWS AIM image.
  4. This all needs to share a security group to allow connections within the network. Not important for a single instance.
  5. This needs to share a VPC to allow connections between each other.
tegefaulkes commented 1 year ago

New epic created at MatrixAI/Polykey#485 added to this one.

tegefaulkes commented 1 year ago

I'm going to create a new issue to track the Polykey infrastructure changes. Just so I don't bloat this issue with the discussion.

CMCDragonkai commented 1 year ago

There are 3 additional things to be accomplished here:

  1. Multi node resolution - given a set of NodeIDs, a set of IPs, resolve all of them together, and give back a record of NodeID to connection. MatrixAI/Polykey#483
  2. DNS resolution to all A records and AAAA records. Basically get all the IPs in one bunch as all candidate IPs for a given NodeId. MatrixAI/Polykey#484
  3. NG key path change, so that we can have this structure BucketIndex/NodeId/Host/Port -> null. MatrixAI/Polykey#482
CMCDragonkai commented 1 year ago

Manual testing starts this week. Using tailscale to help test home NAT routers.

CMCDragonkai commented 1 year ago

Upon the merge of https://gitlab.com/MatrixAI/Engineering/Polykey/Polykey-Infrastructure/-/merge_requests/2, we will begin testing.

Let's write down a manual test specification. Upon each setup, please provide a screenshot accordingly. For command executions, use asciinema.


Test spec moved to MatrixAI/Polykey#487.

tegefaulkes commented 1 year ago

Fixing the Random Process Termination

Cover all our bases by having unique exit codes for:

  1. Unhandled Rejection
  2. Uncaught Exception
  3. Promise Deadlock
  4. Explicit shutdown

Log out the exit codes, and replicate the random failure by fuzz testing a local seed node while simulating conditions on the EC2, by running a local docker container image.

Identifying the correct exit code should be able to tell us what is happening.

ALSO check that we are not being killed by something else like the OOM on the operating system.

Otherwise if we cannot get anything useful, we will use strace on the entire nodejs process and run it there, and try and trigger a crash.

Alternatively we use https://rr-project.org/ and see https://fitzgeraldnick.com/2015/11/02/back-to-the-futurre.html

tegefaulkes commented 1 year ago

Not all but most of the NAT tests are passing again,

CMCDragonkai commented 1 year ago

So it appears upon deploying the new image, we do get a successful connection between office (double NAT) to the seed node. No connection timeouts. So at this point we can assume that connections from home (single NAT) to seed node should also work.

@tegefaulkes is merging the fixes to aborting the connections into the staging branch. The default timeouts have also been changed. Summary list:

  1. The default timeout for connection start is 2000
  2. The default interval for punch is 50
  3. The default interval for keepalive is 1000
  4. The default interval for keepalive timeout is 20000
  5. The node connection TTL is 60000

Aborting with 0 should immediately abort the connection now without starting doing anything for UTP.

When doing errorP.then() this can cause unhandled rejections. We are not supposed to use these, and if you do, an explicit catch is necessary.

When aborting a connection, all resources of the UTP must be destroyed to avoid underlying side effects. Also destroy the TLS socket and any registered handlers.

So our plan is for the rest of today is to consolidate our manual tests into automatic integration tests:

These 2 tests should be done as part of MatrixAI/Polykey#441 PR. The PR may have existing code that doesn't make sense to be used, in that case they should be sliced out, and pasted into the PR for archiving until we can do a proper multi-node simulation. Then MatrixAI/Polykey#441 PR should be merged into staging.

For tomorrow, we will complete the manual testing involving home to office. @tegefaulkes ensure that your home network is 1 NAT, not double NAT in this case. The VM network must be a bridge. Or better yet, use the laptop, not your VM to do this. We'll confirm both cases are working.

At this point manual testing will be done. We will provide evidence of all of this in the MatrixAI/Polykey#487.


The next step is finishing the automation of our multinode setup. This means:

  1. setting up the recovery code pool
  2. fixing the EC2 idempotency code in the infrastructure
  3. setting up multi node attributes and placement constraints
  4. fixing multi-node DNS resolution
  5. adding multi-node resolver into the node graph

With respect to random failures. This requires monitoring the situation on ECS, and then reproducing it locally by fuzzing a docker image locally. That can be a new bug.


There are some additional tasks that can be pushed into staging that we are discovering as we are doing the manual testing.


By Friday, we must update the CI deployment, and the CI should be fully passing (including any of the local NAT tests).

CMCDragonkai commented 1 year ago

In terms of investigating the random failures.

image

The above last message is correlates to state change events, memory utilisation, and CPU spike:

image

There are several connection forwards being repeatedly started, but there is not ENDING message? No failure message? No timeout message???

CMCDragonkai commented 1 year ago

So problems:

  1. Why is it repeating every 20 seconds a connection forward? I thought we changed the timeout, if we changed the timeout, it shouldn't take another 20 seconds to try to connect.
  2. There should always be "closing message" for any "starting message" whether successful or failure.
CMCDragonkai commented 1 year ago

When you have finished running your automated test, you must use testsUtils.processExit to wait for your clients to gracefully exit.

Subsequently on the seed node we are seeing this:

image

Right now there's no logs telling us WHAT is causing the stopping of the connection reverse. Most likely this is a legitimate stop. We should be adding this.logger.info into all the places that call ConnectionReverse.stop inside ConnectionReverse. So we can identify the "cause of the stop".

Furthermore, there is a ErrorConnectionEndTimeout. This error is expected due to UTP native having a flaky ending system. I encountered this before, so it shouldn't be a problem.

CMCDragonkai commented 1 year ago

There is a spurious IP address that exists in the NodeGraph:

image

It is not @tegefaulkes home IP, nor my IP, nor seed node IP. And the node was restarted at 3pm. So how can this IP exist?

Can you restart the agent completely fresh. And start the test again, and see if you are getting these weird IPs.

tegefaulkes commented 1 year ago

Logs from the recent test.

``` INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Creating PolykeyAgent"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Setting umask to 077"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Setting node path to /tmp/polykey-test-5OC8ZH/polykeyB"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Creating PolykeyAgent"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Setting umask to 077"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Setting node path to /tmp/polykey-test-5OC8ZH/polykeyA"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Starting Status"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Starting Status"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Writing Status to /tmp/polykey-test-5OC8ZH/polykeyB/status.json"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Writing Status to /tmp/polykey-test-5OC8ZH/polykeyA/status.json"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Status is STARTING"} INFO:testnet connection test:{"level":"INFO","key":"Schema","msg":"Creating Schema"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Status is STARTING"} INFO:testnet connection test:{"level":"INFO","key":"Schema","msg":"Starting Schema"} INFO:testnet connection test:{"level":"INFO","key":"Schema","msg":"Setting state path to /tmp/polykey-test-5OC8ZH/polykeyB/state"} INFO:testnet connection test:{"level":"INFO","key":"Schema","msg":"Creating Schema"} INFO:testnet connection test:{"level":"INFO","key":"Schema","msg":"Starting Schema"} INFO:testnet connection test:{"level":"INFO","key":"Schema","msg":"Setting state path to /tmp/polykey-test-5OC8ZH/polykeyA/state"} INFO:testnet connection test:{"level":"INFO","key":"Schema","msg":"Started Schema"} INFO:testnet connection test:{"level":"INFO","key":"Schema","msg":"Created Schema"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Creating KeyManager"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Setting keys path to /tmp/polykey-test-5OC8ZH/polykeyB/state/keys"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Starting KeyManager"} INFO:testnet connection test:{"level":"INFO","key":"Schema","msg":"Started Schema"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Checking /tmp/polykey-test-5OC8ZH/polykeyB/state/keys/root.pub and /tmp/polykey-test-5OC8ZH/polykeyB/state/keys/root.key"} INFO:testnet connection test:{"level":"INFO","key":"Schema","msg":"Created Schema"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Creating KeyManager"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Setting keys path to /tmp/polykey-test-5OC8ZH/polykeyA/state/keys"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Using provided root key pair"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Starting KeyManager"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Checking /tmp/polykey-test-5OC8ZH/polykeyA/state/keys/root.pub and /tmp/polykey-test-5OC8ZH/polykeyA/state/keys/root.key"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Using provided root key pair"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Writing /tmp/polykey-test-5OC8ZH/polykeyA/state/keys/root.pub and /tmp/polykey-test-5OC8ZH/polykeyA/state/keys/root.key"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Checking /tmp/polykey-test-5OC8ZH/polykeyA/state/keys/root.crt"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Generating root certificate"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Writing /tmp/polykey-test-5OC8ZH/polykeyB/state/keys/root.pub and /tmp/polykey-test-5OC8ZH/polykeyB/state/keys/root.key"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Checking /tmp/polykey-test-5OC8ZH/polykeyB/state/keys/root.crt"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Generating root certificate"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Writing /tmp/polykey-test-5OC8ZH/polykeyB/state/keys/root.crt"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Checking /tmp/polykey-test-5OC8ZH/polykeyB/state/keys/db.key"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Generating keys db key"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Writing /tmp/polykey-test-5OC8ZH/polykeyA/state/keys/root.crt"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Checking /tmp/polykey-test-5OC8ZH/polykeyA/state/keys/db.key"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Generating keys db key"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Writing /tmp/polykey-test-5OC8ZH/polykeyB/state/keys/db.key"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Started KeyManager"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Created KeyManager"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Writing /tmp/polykey-test-5OC8ZH/polykeyA/state/keys/db.key"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Creating DB"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Started KeyManager"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Starting DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Setting DB path to /tmp/polykey-test-5OC8ZH/polykeyB/state/db"} INFO:testnet connection test:{"level":"INFO","key":"KeyManager","msg":"Created KeyManager"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Creating DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Starting DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Setting DB path to /tmp/polykey-test-5OC8ZH/polykeyA/state/db"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Started DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Created DB"} INFO:testnet connection test:{"level":"INFO","key":"IdentitiesManager","msg":"Creating IdentitiesManager"} INFO:testnet connection test:{"level":"INFO","key":"IdentitiesManager","msg":"Starting IdentitiesManager"} INFO:testnet connection test:{"level":"INFO","key":"IdentitiesManager","msg":"Started IdentitiesManager"} INFO:testnet connection test:{"level":"INFO","key":"IdentitiesManager","msg":"Created IdentitiesManager"} INFO:testnet connection test:{"level":"INFO","key":"Sigchain","msg":"Creating Sigchain"} INFO:testnet connection test:{"level":"INFO","key":"Sigchain","msg":"Starting Sigchain"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Started DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Created DB"} INFO:testnet connection test:{"level":"INFO","key":"IdentitiesManager","msg":"Creating IdentitiesManager"} INFO:testnet connection test:{"level":"INFO","key":"Sigchain","msg":"Started Sigchain"} INFO:testnet connection test:{"level":"INFO","key":"Sigchain","msg":"Created Sigchain"} INFO:testnet connection test:{"level":"INFO","key":"IdentitiesManager","msg":"Starting IdentitiesManager"} INFO:testnet connection test:{"level":"INFO","key":"IdentitiesManager","msg":"Started IdentitiesManager"} INFO:testnet connection test:{"level":"INFO","key":"IdentitiesManager","msg":"Created IdentitiesManager"} INFO:testnet connection test:{"level":"INFO","key":"ACL","msg":"Creating ACL"} INFO:testnet connection test:{"level":"INFO","key":"ACL","msg":"Starting ACL"} INFO:testnet connection test:{"level":"INFO","key":"Sigchain","msg":"Creating Sigchain"} INFO:testnet connection test:{"level":"INFO","key":"ACL","msg":"Started ACL"} INFO:testnet connection test:{"level":"INFO","key":"ACL","msg":"Created ACL"} INFO:testnet connection test:{"level":"INFO","key":"GestaltGraph","msg":"Creating GestaltGraph"} INFO:testnet connection test:{"level":"INFO","key":"Sigchain","msg":"Starting Sigchain"} INFO:testnet connection test:{"level":"INFO","key":"GestaltGraph","msg":"Starting GestaltGraph"} INFO:testnet connection test:{"level":"INFO","key":"GestaltGraph","msg":"Started GestaltGraph"} INFO:testnet connection test:{"level":"INFO","key":"GestaltGraph","msg":"Created GestaltGraph"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Creating Proxy"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Created Proxy"} INFO:testnet connection test:{"level":"INFO","key":"NodeGraph","msg":"Creating NodeGraph"} INFO:testnet connection test:{"level":"INFO","key":"NodeGraph","msg":"Starting NodeGraph"} INFO:testnet connection test:{"level":"INFO","key":"NodeGraph","msg":"Started NodeGraph"} INFO:testnet connection test:{"level":"INFO","key":"NodeGraph","msg":"Created NodeGraph"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Creating TaskManager"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Starting TaskManager in Lazy Mode"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Begin Tasks Repair"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Finish Tasks Repair"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Started TaskManager"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Created TaskManager"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Starting NodeManager"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Registering handler for setNode"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Setting up refreshBucket tasks"} INFO:testnet connection test:{"level":"INFO","key":"Sigchain","msg":"Started Sigchain"} INFO:testnet connection test:{"level":"INFO","key":"Sigchain","msg":"Created Sigchain"} INFO:testnet connection test:{"level":"INFO","key":"ACL","msg":"Creating ACL"} INFO:testnet connection test:{"level":"INFO","key":"ACL","msg":"Starting ACL"} INFO:testnet connection test:{"level":"INFO","key":"ACL","msg":"Started ACL"} INFO:testnet connection test:{"level":"INFO","key":"ACL","msg":"Created ACL"} INFO:testnet connection test:{"level":"INFO","key":"GestaltGraph","msg":"Creating GestaltGraph"} INFO:testnet connection test:{"level":"INFO","key":"GestaltGraph","msg":"Starting GestaltGraph"} INFO:testnet connection test:{"level":"INFO","key":"GestaltGraph","msg":"Started GestaltGraph"} INFO:testnet connection test:{"level":"INFO","key":"GestaltGraph","msg":"Created GestaltGraph"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Creating Proxy"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Created Proxy"} INFO:testnet connection test:{"level":"INFO","key":"NodeGraph","msg":"Creating NodeGraph"} INFO:testnet connection test:{"level":"INFO","key":"NodeGraph","msg":"Starting NodeGraph"} INFO:testnet connection test:{"level":"INFO","key":"NodeGraph","msg":"Started NodeGraph"} INFO:testnet connection test:{"level":"INFO","key":"NodeGraph","msg":"Created NodeGraph"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Creating TaskManager"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Starting TaskManager in Lazy Mode"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Begin Tasks Repair"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Finish Tasks Repair"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Started TaskManager"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Created TaskManager"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Starting NodeManager"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Registering handler for setNode"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Setting up refreshBucket tasks"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Set up refreshBucket tasks"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Started NodeManager"} INFO:testnet connection test:{"level":"INFO","key":"Discovery","msg":"Creating Discovery"} INFO:testnet connection test:{"level":"INFO","key":"Discovery","msg":"Starting Discovery"} INFO:testnet connection test:{"level":"INFO","key":"Discovery","msg":"Started Discovery"} INFO:testnet connection test:{"level":"INFO","key":"Discovery","msg":"Created Discovery"} INFO:testnet connection test:{"level":"INFO","key":"NotificationsManager","msg":"Creating NotificationsManager"} INFO:testnet connection test:{"level":"INFO","key":"NotificationsManager","msg":"Starting NotificationsManager"} INFO:testnet connection test:{"level":"INFO","key":"NotificationsManager","msg":"Started NotificationsManager"} INFO:testnet connection test:{"level":"INFO","key":"NotificationsManager","msg":"Created NotificationsManager"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Creating VaultManager"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Setting vaults path to /tmp/polykey-test-5OC8ZH/polykeyB/state/vaults"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Starting VaultManager"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Generating vaults key"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Creating DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Starting DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Setting DB path to /tmp/polykey-test-5OC8ZH/polykeyB/state/vaults/efs"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Set up refreshBucket tasks"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Started NodeManager"} INFO:testnet connection test:{"level":"INFO","key":"Discovery","msg":"Creating Discovery"} INFO:testnet connection test:{"level":"INFO","key":"Discovery","msg":"Starting Discovery"} INFO:testnet connection test:{"level":"INFO","key":"Discovery","msg":"Started Discovery"} INFO:testnet connection test:{"level":"INFO","key":"Discovery","msg":"Created Discovery"} INFO:testnet connection test:{"level":"INFO","key":"NotificationsManager","msg":"Creating NotificationsManager"} INFO:testnet connection test:{"level":"INFO","key":"NotificationsManager","msg":"Starting NotificationsManager"} INFO:testnet connection test:{"level":"INFO","key":"NotificationsManager","msg":"Started NotificationsManager"} INFO:testnet connection test:{"level":"INFO","key":"NotificationsManager","msg":"Created NotificationsManager"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Creating VaultManager"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Setting vaults path to /tmp/polykey-test-5OC8ZH/polykeyA/state/vaults"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Starting VaultManager"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Generating vaults key"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Creating DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Starting DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Setting DB path to /tmp/polykey-test-5OC8ZH/polykeyA/state/vaults/efs"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Started DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Created DB"} INFO:testnet connection test:{"level":"INFO","key":"INodeManager","msg":"Creating INodeManager"} INFO:testnet connection test:{"level":"INFO","key":"INodeManager","msg":"Starting INodeManager"} INFO:testnet connection test:{"level":"INFO","key":"INodeManager","msg":"Started INodeManager"} INFO:testnet connection test:{"level":"INFO","key":"INodeManager","msg":"Created INodeManager"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Started DB"} INFO:testnet connection test:{"level":"INFO","key":"DB","msg":"Created DB"} INFO:testnet connection test:{"level":"INFO","key":"INodeManager","msg":"Creating INodeManager"} INFO:testnet connection test:{"level":"INFO","key":"INodeManager","msg":"Starting INodeManager"} INFO:testnet connection test:{"level":"INFO","key":"INodeManager","msg":"Started INodeManager"} INFO:testnet connection test:{"level":"INFO","key":"INodeManager","msg":"Created INodeManager"} INFO:testnet connection test:{"level":"INFO","key":"EncryptedFileSystem","msg":"Starting EncryptedFS"} INFO:testnet connection test:{"level":"INFO","key":"EncryptedFileSystem","msg":"Started EncryptedFS"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Started VaultManager"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Created VaultManager"} INFO:testnet connection test:{"level":"INFO","key":"SessionManager","msg":"Creating SessionManager"} INFO:testnet connection test:{"level":"INFO","key":"SessionManager","msg":"Starting SessionManager"} INFO:testnet connection test:{"level":"INFO","key":"SessionManager","msg":"Generating sessions key"} INFO:testnet connection test:{"level":"INFO","key":"SessionManager","msg":"Started SessionManager"} INFO:testnet connection test:{"level":"INFO","key":"SessionManager","msg":"Created SessionManager"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Starting PolykeyAgent"} INFO:testnet connection test:{"level":"INFO","key":"GRPCServerClient","msg":"Starting GRPCServer on 127.0.0.1:0"} INFO:testnet connection test:{"level":"INFO","key":"EncryptedFileSystem","msg":"Starting EncryptedFS"} INFO:testnet connection test:{"level":"INFO","key":"EncryptedFileSystem","msg":"Started EncryptedFS"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Started VaultManager"} INFO:testnet connection test:{"level":"INFO","key":"VaultManager","msg":"Created VaultManager"} INFO:testnet connection test:{"level":"INFO","key":"SessionManager","msg":"Creating SessionManager"} INFO:testnet connection test:{"level":"INFO","key":"SessionManager","msg":"Starting SessionManager"} INFO:testnet connection test:{"level":"INFO","key":"SessionManager","msg":"Generating sessions key"} INFO:testnet connection test:{"level":"INFO","key":"SessionManager","msg":"Started SessionManager"} INFO:testnet connection test:{"level":"INFO","key":"SessionManager","msg":"Created SessionManager"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Starting PolykeyAgent"} INFO:testnet connection test:{"level":"INFO","key":"GRPCServerClient","msg":"Started GRPCServer on 127.0.0.1:42975"} INFO:testnet connection test:{"level":"INFO","key":"GRPCServerAgent","msg":"Starting GRPCServer on 127.0.0.1:0"} INFO:testnet connection test:{"level":"INFO","key":"GRPCServerAgent","msg":"Started GRPCServer on 127.0.0.1:41157"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Starting Forward Proxy from 127.0.0.1:0 to 0.0.0.0:0 and Reverse Proxy from 0.0.0.0:0 to 127.0.0.1:41157"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Started Forward Proxy from 127.0.0.1:41895 to 0.0.0.0:50776 and Reverse Proxy from 0.0.0.0:50776 to 127.0.0.1:41157"} INFO:testnet connection test:{"level":"INFO","key":"NodeConnectionManager","msg":"Starting NodeConnectionManager"} INFO:testnet connection test:{"level":"INFO","key":"GRPCServerClient","msg":"Starting GRPCServer on 127.0.0.1:0"} INFO:testnet connection test:{"level":"INFO","key":"NodeConnectionManager","msg":"Started NodeConnectionManager"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Syncing nodeGraph"} INFO:testnet connection test:{"level":"INFO","key":"GRPCServerClient","msg":"Started GRPCServer on 127.0.0.1:44103"} INFO:testnet connection test:{"level":"INFO","key":"GRPCServerAgent","msg":"Starting GRPCServer on 127.0.0.1:0"} INFO:testnet connection test:{"level":"INFO","key":"GRPCServerAgent","msg":"Started GRPCServer on 127.0.0.1:32849"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Starting Forward Proxy from 127.0.0.1:0 to 0.0.0.0:0 and Reverse Proxy from 0.0.0.0:0 to 127.0.0.1:32849"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Started Forward Proxy from 127.0.0.1:42959 to 0.0.0.0:58593 and Reverse Proxy from 0.0.0.0:58593 to 127.0.0.1:32849"} INFO:testnet connection test:{"level":"INFO","key":"NodeConnectionManager","msg":"Starting NodeConnectionManager"} INFO:testnet connection test:{"level":"INFO","key":"NodeConnectionManager","msg":"Started NodeConnectionManager"} INFO:testnet connection test:{"level":"INFO","key":"NodeManager","msg":"Syncing nodeGraph"} INFO:testnet connection test:{"level":"INFO","key":"ConnectionForward 3.106.178.29:1314","msg":"Starting Connection Forward"} INFO:testnet connection test:{"level":"INFO","key":"ConnectionForward 3.106.178.29:1314","msg":"Starting Connection Forward"} INFO:testnet connection test:{"level":"INFO","key":"ConnectionForward 3.106.178.29:1314","msg":"Started Connection Forward"} INFO:testnet connection test:{"level":"INFO","key":"NodeConnection 3.106.178.29:1314","msg":"Creating NodeConnection"} INFO:testnet connection test:{"level":"INFO","key":"clientFactory","msg":"Creating GRPCClientAgent connecting to 3.106.178.29:1314"} INFO:testnet connection test:{"level":"INFO","key":"ConnectionForward 3.106.178.29:1314","msg":"Started Connection Forward"} INFO:testnet connection test:{"level":"INFO","key":"NodeConnection 3.106.178.29:1314","msg":"Creating NodeConnection"} INFO:testnet connection test:{"level":"INFO","key":"clientFactory","msg":"Creating GRPCClientAgent connecting to 3.106.178.29:1314"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Handling CONNECT to 3.106.178.29:1314"} INFO:testnet connection test:{"level":"INFO","key":"ConnectionForward 3.106.178.29:1314","msg":"Composing Connection Forward"} INFO:testnet connection test:{"level":"INFO","key":"ConnectionForward 3.106.178.29:1314","msg":"Composed Connection Forward"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Handled CONNECT to 3.106.178.29:1314"} INFO:testnet connection test:{"level":"INFO","key":"clientFactory","msg":"Created GRPCClientAgent connecting to 3.106.178.29:1314"} INFO:testnet connection test:{"level":"INFO","key":"NodeConnection 3.106.178.29:1314","msg":"Created NodeConnection"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Handling CONNECT to 3.106.178.29:1314"} INFO:testnet connection test:{"level":"INFO","key":"ConnectionForward 3.106.178.29:1314","msg":"Composing Connection Forward"} INFO:testnet connection test:{"level":"INFO","key":"ConnectionForward 3.106.178.29:1314","msg":"Composed Connection Forward"} INFO:testnet connection test:{"level":"INFO","key":"Proxy","msg":"Handled CONNECT to 3.106.178.29:1314"} INFO:testnet connection test:{"level":"INFO","key":"clientFactory","msg":"Created GRPCClientAgent connecting to 3.106.178.29:1314"} INFO:testnet connection test:{"level":"INFO","key":"NodeConnection 3.106.178.29:1314","msg":"Created NodeConnection"} INFO:testnet connection test:{"level":"INFO","key":"scheduler","msg":"Starting Scheduling Loop"} INFO:testnet connection test:{"level":"INFO","key":"scheduler","msg":"Starting Scheduling Loop"} INFO:testnet connection test:{"level":"INFO","key":"scheduler","msg":"Started Scheduling Loop"} INFO:testnet connection test:{"level":"INFO","key":"scheduler","msg":"Started Scheduling Loop"} INFO:testnet connection test:{"level":"INFO","key":"queue","msg":"Starting Queueing Loop"} INFO:testnet connection test:{"level":"INFO","key":"queue","msg":"Started Queueing Loop"} INFO:testnet connection test:{"level":"INFO","key":"queue","msg":"Starting Queueing Loop"} INFO:testnet connection test:{"level":"INFO","key":"queue","msg":"Started Queueing Loop"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Finish Status STARTING"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Writing Status to /tmp/polykey-test-5OC8ZH/polykeyA/status.json"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Finish Status STARTING"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Writing Status to /tmp/polykey-test-5OC8ZH/polykeyB/status.json"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Status is LIVE"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Started PolykeyAgent"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Created PolykeyAgent"} INFO:testnet connection test:{"level":"INFO","key":"Status","msg":"Status is LIVE"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Started PolykeyAgent"} INFO:testnet connection test:{"level":"INFO","key":"PolykeyAgent","msg":"Created PolykeyAgent"} INFO:testnet connection test:{"pid":993645,"nodeId":"vqvjb790egq76hpd2k74tsckpgkp9rpjrh5s8ecvun5q548v4ccu0","clientHost":"127.0.0.1","clientPort":44103,"agentHost":"127.0.0.1","agentPort":32849,"proxyHost":"0.0.0.0","proxyPort":58593,"forwardHost":"127.0.0.1","forwardPort":42959} INFO:testnet connection test:{"pid":993647,"nodeId":"vueouai046p1g0kej0hrlk9i2m635skiktptj0line40iqs7fijtg","clientHost":"127.0.0.1","clientPort":42975,"agentHost":"127.0.0.1","agentPort":41157,"proxyHost":"0.0.0.0","proxyPort":50776,"forwardHost":"127.0.0.1","forwardPort":41895} console.log vqvjb790egq76hpd2k74tsckpgkp9rpjrh5s8ecvun5q548v4ccu0 at Object. (tests/integration/testnet/testnetConnection.test.ts:149:15) console.log vueouai046p1g0kej0hrlk9i2m635skiktptj0line40iqs7fijtg at Object. (tests/integration/testnet/testnetConnection.test.ts:150:15) // Pinging NodeB from NodeA is failing. Error: expect(received).toBe(expected) // Object.is equality Expected: 0 Received: 1 ```
CMCDragonkai commented 1 year ago

Next problem, in your 2 nodes test to the seed node. After we have restarted the seed node, the spurious IP is no longer there. We are seeing a successful composition of the 2 connections from the 2 nodes:

image

At this point there's is also no ErrorConnectionEndTimeout either.

So connections to the seed node work.

However there are no logs for the signalling behaviour that the seednode is meant to do for the 2 nodes.

@tegefaulkes you must add in some new logs for the signalling behaviour in the seed node, probably by putting it into the GRPC service handler that performs this task. This handler should be called sendSignalingMessage. We should also be observing the response data in your tests as well. You want to log out what is occurring in the client side....

So the client side should be checking the response of the sendSignalingMessage call. And the response is a negative, that should mean the seed node rejects this attempt to signal things. Log this out as well.

CMCDragonkai commented 1 year ago

In the decentralised NAT discussion MatrixAI/Polykey#365, we realised that we needed to maintain some connection liveness for the NAT busting to work. That was something like 6 random connections.

In this situation with a centralised seed nodes, we just need to maintain connections to the seed nodes without TTLs.

image

This means the node connection TTLs must not apply to the nodes designated as the seed nodes. Once we start a node, and it connects to the testnet, it maintains that connection to the testnet forever!!! Figure out the distinction between node connection and proxy connection.

Node connection TTL is based on RPC traffic. Proxy connection TTL is based on keep alive. So what we really need is to maintain the proxy connection which it should be via the keep alive. This will ensure that the external ports in the NodeGraph is always eventually consistent and should converge quickly.

The signaling process just sets up the bidirectional hole punching. As soon as A1 and A2 are sending packets to each other, the signaling is not involved anymore. They just need to send packets to each other and the hole punching to work (up to and including endpoint-independent proxy-restricted NAT).

So @tegefaulkes you need to differentiate the logs between A1 and A2, give them different prefixes, and then observe for the concurrent connection forwards to each other.

I think we should also change the default logger format for PK to output keys and not key. This will be more useful for us.

CMCDragonkai commented 1 year ago

image

It's important to remember that we are sending UDP packets. In such a case, port preservation is not required, and hole punching should work.

Of course symmetric NAT will still be unpunchable. But that's not going to be the problem here with a home router.

CMCDragonkai commented 1 year ago

As soon as the 2 nodes to testnet is working (that means the signaling process is working), then I suspect the home to office manual test should also work. So this is the next priority after you've merged that PR.

CMCDragonkai commented 1 year ago

Reference from libp2p's implementation that they only did recently: https://blog.ipfs.tech/2022-01-20-libp2p-hole-punching/

tegefaulkes commented 1 year ago

MatrixAI/Polykey#441 has been merged.

tegefaulkes commented 1 year ago
CMCDragonkai commented 1 year ago

I've been running the agent since yesterday, over 24 hours. No crash atm. The task manager and related things have been working fine. So we'll push back the analysis of why it's crashing up on ECS to the last set of bugs to look into.

CMCDragonkai commented 1 year ago

Some things to change for the NodeConnectionManager.relaySignallingMessage:

  1. The method should be taking the sourceAddress. This is the address that is to be acquired by the connectionInfoGetter at the GRPC handler level.
  2. This method should take the sourceAddress along with the existing message and send a new signal message to the next node.
  3. The GRPC handler for this relaying operation should be checking if the message is directed towards itself. If it isn't, it acquires the source address and passes it to the NodeConnectionManager.relaySignallingMessage. If it is, then it uses the message's source address properties, and if that doesn't exist, it will throw an exception which should bubble back to the sender. Once it has the message's source address properties, it will hole punch back there.

The reason we use connectionInfoGetter instead of taking it from the NG is because the updating the NG with the new source address for a given incoming connection may be asynchronous. So by using the connectionInfoGetter, we are always using the most up to date information to relay it.

At the same time, the node connection TTL should be disabled or set to very high amount of time to eliminate this factor in our testing. Subsequently the node connection TTL should be disabled for node connections to seed nodes, or at the very least, the underlying proxy connection should be maintained to the seed nodes forever.

CMCDragonkai commented 1 year ago

As long as we see N1 have a ConnectionForward to N2 and simultaneously N2 have a ConnectionReverse to N1, then we can at the very least say that the signalling operation is successful.

Whether the packets actually go through or not will depend on the CGNAT in our office not acting like a symmetric NAT. So that's a second layer check.

To confirm that this target. @tegefaulkes get evidence of the packets from both ends N1 and N2 on your local wireshark, and also correlate with the expected logs of ConnectionForward on N1 and ConnectionReverse on N2.

tegefaulkes commented 1 year ago

Ok, so I've gotten everything set up and run the test. Here is a log of the test from the local perspective.

``` console.log Starting Agent1 at Object. (tests/integration/testnet/testnetConnection.test.ts:242:13) console.log Starting Agent2 at Object. (tests/integration/testnet/testnetConnection.test.ts:259:13) TCT.A2:Creating PolykeyAgent TCT.A2:Setting umask to 077 TCT.A2:Setting node path to /tmp/polykey-test-lYd3n2/node2 TCT.A2.Status:Starting Status TCT.A2.Status:Writing Status to /tmp/polykey-test-lYd3n2/node2/status.json TCT.A2.Status:Status is STARTING TCT.A2.Schema:Creating Schema TCT.A2.Schema:Starting Schema TCT.A2.Schema:Setting state path to /tmp/polykey-test-lYd3n2/node2/state TCT.A2.Schema:Started Schema TCT.A2.Schema:Created Schema TCT.A2.KeyManager:Creating KeyManager TCT.A2.KeyManager:Setting keys path to /tmp/polykey-test-lYd3n2/node2/state/keys TCT.A2.KeyManager:Starting KeyManager TCT.A2.KeyManager:Checking /tmp/polykey-test-lYd3n2/node2/state/keys/root.pub and /tmp/polykey-test-lYd3n2/node2/state/keys/root.key TCT.A2.KeyManager:Using provided root key pair TCT.A2.KeyManager:Writing /tmp/polykey-test-lYd3n2/node2/state/keys/root.pub and /tmp/polykey-test-lYd3n2/node2/state/keys/root.key TCT.A2.KeyManager:Checking /tmp/polykey-test-lYd3n2/node2/state/keys/root.crt TCT.A2.KeyManager:Generating root certificate TCT.A2.KeyManager:Writing /tmp/polykey-test-lYd3n2/node2/state/keys/root.crt TCT.A2.KeyManager:Checking /tmp/polykey-test-lYd3n2/node2/state/keys/db.key TCT.A2.KeyManager:Generating keys db key TCT.A2.KeyManager:Writing /tmp/polykey-test-lYd3n2/node2/state/keys/db.key TCT.A2.KeyManager:Started KeyManager TCT.A2.KeyManager:Created KeyManager TCT.A2.DB:Creating DB TCT.A2.DB:Starting DB TCT.A2.DB:Setting DB path to /tmp/polykey-test-lYd3n2/node2/state/db TCT.A2.DB:Started DB TCT.A2.DB:Created DB TCT.A2.IdentitiesManager:Creating IdentitiesManager TCT.A2.IdentitiesManager:Starting IdentitiesManager TCT.A2.IdentitiesManager:Started IdentitiesManager TCT.A2.IdentitiesManager:Created IdentitiesManager TCT.A2.Sigchain:Creating Sigchain TCT.A2.Sigchain:Starting Sigchain TCT.A2.Sigchain:Started Sigchain TCT.A2.Sigchain:Created Sigchain TCT.A2.ACL:Creating ACL TCT.A2.ACL:Starting ACL TCT.A2.ACL:Started ACL TCT.A2.ACL:Created ACL TCT.A2.GestaltGraph:Creating GestaltGraph TCT.A2.GestaltGraph:Starting GestaltGraph TCT.A2.GestaltGraph:Started GestaltGraph TCT.A2.GestaltGraph:Created GestaltGraph TCT.A2.Proxy:Creating Proxy TCT.A2.Proxy:Created Proxy TCT.A2.NodeGraph:Creating NodeGraph TCT.A2.NodeGraph:Starting NodeGraph TCT.A2.NodeGraph:Started NodeGraph TCT.A2.NodeGraph:Created NodeGraph TCT.A2:Creating TaskManager TCT.A2:Starting TaskManager in Lazy Mode TCT.A2:Begin Tasks Repair TCT.A2:Finish Tasks Repair TCT.A2:Started TaskManager TCT.A2:Created TaskManager TCT.A2.NodeManager:Starting NodeManager TCT.A2.NodeManager:Registering handler for setNode TCT.A2.NodeManager:Setting up refreshBucket tasks TCT.A2.NodeManager:Set up refreshBucket tasks TCT.A2.NodeManager:Started NodeManager TCT.A2.Discovery:Creating Discovery TCT.A2.Discovery:Starting Discovery TCT.A2.Discovery:Started Discovery TCT.A2.Discovery:Created Discovery TCT.A2.NotificationsManager:Creating NotificationsManager TCT.A2.NotificationsManager:Starting NotificationsManager TCT.A2.NotificationsManager:Started NotificationsManager TCT.A2.NotificationsManager:Created NotificationsManager TCT.A2.VaultManager:Creating VaultManager TCT.A2.VaultManager:Setting vaults path to /tmp/polykey-test-lYd3n2/node2/state/vaults TCT.A2.VaultManager:Starting VaultManager TCT.A2.VaultManager:Generating vaults key TCT.A2.VaultManager.EncryptedFileSystem.DB:Creating DB TCT.A2.VaultManager.EncryptedFileSystem.DB:Starting DB TCT.A2.VaultManager.EncryptedFileSystem.DB:Setting DB path to /tmp/polykey-test-lYd3n2/node2/state/vaults/efs TCT.A2.VaultManager.EncryptedFileSystem.DB:Started DB TCT.A2.VaultManager.EncryptedFileSystem.DB:Created DB TCT.A2.VaultManager.EncryptedFileSystem.INodeManager:Creating INodeManager TCT.A2.VaultManager.EncryptedFileSystem.INodeManager:Starting INodeManager TCT.A2.VaultManager.EncryptedFileSystem.INodeManager:Started INodeManager TCT.A2.VaultManager.EncryptedFileSystem.INodeManager:Created INodeManager TCT.A2.VaultManager.EncryptedFileSystem:Starting EncryptedFS TCT.A2.VaultManager.EncryptedFileSystem:Started EncryptedFS TCT.A2.VaultManager:Started VaultManager TCT.A2.VaultManager:Created VaultManager TCT.A2.SessionManager:Creating SessionManager TCT.A2.SessionManager:Starting SessionManager TCT.A2.SessionManager:Generating sessions key TCT.A2.SessionManager:Started SessionManager TCT.A2.SessionManager:Created SessionManager TCT.A2:Starting PolykeyAgent TCT.A2.GRPCServerClient:Starting GRPCServer on 127.0.0.1:0 TCT.A2.GRPCServerClient:Started GRPCServer on 127.0.0.1:36013 TCT.A2.GRPCServerAgent:Starting GRPCServer on 127.0.0.1:0 TCT.A2.GRPCServerAgent:Started GRPCServer on 127.0.0.1:44277 TCT.A2.Proxy:Starting Forward Proxy from 127.0.0.1:0 to 0.0.0.0:55552 and Reverse Proxy from 0.0.0.0:55552 to 127.0.0.1:44277 TCT.A2.Proxy:Started Forward Proxy from 127.0.0.1:44119 to 0.0.0.0:55552 and Reverse Proxy from 0.0.0.0:55552 to 127.0.0.1:44277 TCT.A2.NodeConnectionManager:Starting NodeConnectionManager TCT.A2.NodeConnectionManager:Started NodeConnectionManager TCT.A2.NodeManager:Syncing nodeGraph TCT.A2.Proxy.ConnectionForward 3.106.178.29:1314:Starting Connection Forward TCT.A2.Proxy.ConnectionForward 3.106.178.29:1314:Started Connection Forward TCT.A2.NodeConnectionManager.NodeConnection 3.106.178.29:1314:Creating NodeConnection TCT.A2.NodeConnectionManager.NodeConnection 3.106.178.29:1314.clientFactory:Creating GRPCClientAgent connecting to 3.106.178.29:1314 TCT.A2.Proxy:Handling CONNECT to 3.106.178.29:1314 TCT.A2.Proxy.ConnectionForward 3.106.178.29:1314:Composing Connection Forward TCT.A2.Proxy.ConnectionForward 3.106.178.29:1314:Composed Connection Forward TCT.A2.Proxy:Handled CONNECT to 3.106.178.29:1314 TCT.A2.NodeConnectionManager.NodeConnection 3.106.178.29:1314.clientFactory:Created GRPCClientAgent connecting to 3.106.178.29:1314 TCT.A2.NodeConnectionManager.NodeConnection 3.106.178.29:1314:Created NodeConnection TCT.A2.scheduler:Starting Scheduling Loop TCT.A2.scheduler:Started Scheduling Loop TCT.A2.queue:Starting Queueing Loop TCT.A2.queue:Started Queueing Loop TCT.A2.Status:Finish Status STARTING TCT.A2.Status:Writing Status to /tmp/polykey-test-lYd3n2/node2/status.json TCT.A2.Status:Status is LIVE TCT.A2:Started PolykeyAgent TCT.A2:Created PolykeyAgent TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Starting Connection Forward TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Failed to start Connection Forward: aborted TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Starting Connection Forward TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Receives tlsSocket close console.log vueouai046p1g0kej0hrlk9i2m635skiktptj0line40iqs7fijtg 0.0.0.0 55551 at Object. (tests/integration/testnet/testnetConnection.test.ts:287:15) console.log vsid6u2smvjjsmm31lrckvsp6fu125fhhajv4795ibgpkj1p2fhmg 0.0.0.0 55552 at Object. (tests/integration/testnet/testnetConnection.test.ts:292:15) console.log Attempting ping at Object. (tests/integration/testnet/testnetConnection.test.ts:297:15) TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Failed to start Connection Forward: UTP_ETIMEDOUT TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Starting Connection Forward TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Receives tlsSocket close TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Failed to start Connection Forward: UTP_ETIMEDOUT TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Starting Connection Forward TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Receives tlsSocket close TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Failed to start Connection Forward: UTP_ETIMEDOUT TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Starting Connection Forward TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Receives tlsSocket close TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Failed to start Connection Forward: UTP_ETIMEDOUT TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Starting Connection Forward TCT.A2.Proxy.ConnectionForward 120.17.171.29:4863:Failed to start Connection Forward: aborted console.error ErrorNodeGraphNodeIdNotFound: at constructor_.pingNode (/home/faulkes/matrixCode/polykey/js-polykey/src/nodes/NodeManager.ts:238:13) at processTicksAndRejections (node:internal/process/task_queues:96:5) { timestamp: 2022-10-28T03:33:40.669Z, data: {}, cause: undefined, exitCode: 67 } 300 | ); 301 | } catch (e) { > 302 | console.error(e); | ^ 303 | throw e; 304 | } finally { 305 | logger.setLevel(LogLevel.WARN); at Object. (tests/integration/testnet/testnetConnection.test.ts:302:15) console.log cleaning up at Object. (tests/integration/testnet/testnetConnection.test.ts:306:15) TCT.A1.Proxy.ConnectionForward 3.106.178.29:1314:Forward Error: ErrorConnectionEndTimeout TCT.A2.Proxy.ConnectionForward 3.106.178.29:1314:Forward Error: ErrorConnectionEndTimeout ErrorNodeGraphNodeIdNotFound: at constructor_.pingNode (/home/faulkes/matrixCode/polykey/js-polykey/src/nodes/NodeManager.ts:238:13) at processTicksAndRejections (node:internal/process/task_queues:96:5) ``` Seed node log ```   | 2022-10-28T14:32:10.696+11:00 | {"level":"INFO","key":"Proxy","msg":"Handling connection from 120.17.171.29:4863"}   | 2022-10-28T14:32:10.723+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4863","msg":"Starting Connection Reverse"}   | 2022-10-28T14:32:11.049+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4863","msg":"Started Connection Reverse"}   | 2022-10-28T14:32:11.049+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4863","msg":"Composing Connection Reverse"}   | 2022-10-28T14:32:11.569+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4863","msg":"Composed Connection Reverse"}   | 2022-10-28T14:32:11.571+11:00 | {"level":"INFO","key":"PolykeyAgent","msg":"Reverse connection adding vueouai046p1g0kej0hrlk9i2m635skiktptj0line40iqs7fijtg:120.17.171.29:4863 to NodeGraph"}   | 2022-10-28T14:32:11.725+11:00 | {"level":"INFO","key":"Proxy","msg":"Handled connection from 120.17.171.29:4863"}   | 2022-10-28T14:32:13.567+11:00 | {"level":"INFO","key":"Proxy","msg":"Handling connection from 120.17.171.29:4608"}   | 2022-10-28T14:32:13.568+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4608","msg":"Starting Connection Reverse"}   | 2022-10-28T14:32:13.660+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4608","msg":"Started Connection Reverse"}   | 2022-10-28T14:32:13.660+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4608","msg":"Composing Connection Reverse"}   | 2022-10-28T14:32:13.845+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4608","msg":"Composed Connection Reverse"}   | 2022-10-28T14:32:13.845+11:00 | {"level":"INFO","key":"PolykeyAgent","msg":"Reverse connection adding vsid6u2smvjjsmm31lrckvsp6fu125fhhajv4795ibgpkj1p2fhmg:120.17.171.29:4608 to NodeGraph"}   | 2022-10-28T14:32:13.886+11:00 | {"level":"INFO","key":"Proxy","msg":"Handled connection from 120.17.171.29:4608"}   | 2022-10-28T14:33:40.899+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4863","msg":"Stopping Connection Reverse"}   | 2022-10-28T14:33:40.943+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4863","msg":"Stopped Connection Reverse"}   | 2022-10-28T14:33:42.130+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4608","msg":"Stopping Connection Reverse"}   | 2022-10-28T14:33:42.189+11:00 | {"level":"INFO","key":"ConnectionReverse 120.17.171.29:4608","msg":"Stopped Connection Reverse"} ```

Based on these results I can tell the following

  1. both nodes are connecting to the seed node and their details are being added.
  2. A2 is pinging A1 with the correct details.
  3. The signalling messages are never sent.

Looking at how the ping node is implemented. It works in 2 stages.

  1. it does a find node for the node it's after.
  2. Using the details from the find node it tries to ping the node, this is the stage where the hole punching is done.

Here it is failing at stage 1, with the error ErrorNodeGraphNodeIdNotFound.

CMCDragonkai commented 1 year ago

So the reason is that node information is not added to the NodeGraph until a connection is made. But in this case, ICE must be done... and it seems ICE is not being done if "The signalling messages are never sent".

The solution may be to factor the ICE so there's only 1 place in the codebase, probably the NodeConnectionManager to do the ICE. Centralise it to that single method there. So that any time there's an attempt to connect to a node, we should see an ICE attempt.

tegefaulkes commented 1 year ago

There was a bug with the pingNode implementation. When checking if the target was a seed node it was always returning true. As a result it was never sending the signalling packers.

With that fixed I have some progress. I can see the signalling packets being sent now and the contain the correct information. Both nodes are sending their ping packets to establish the connection to the right destinations.

824 15:46:28.650004791  192.168.1.103   120.17.171.29   UDP 64  55551 → 4608 Len=20
825 15:46:28.651062640  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
880 15:46:30.215503835  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
915 15:46:31.006846790  192.168.1.103   120.17.171.29   UDP 64  55551 → 4608 Len=20
916 15:46:31.008337907  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1064    15:46:31.824906773  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1082    15:46:31.905726197  192.168.1.103   120.17.171.29   UDP 64  55552 → 4863 Len=20
1083    15:46:31.907199992  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1085    15:46:32.008758907  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1093    15:46:32.825800648  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1097    15:46:32.908174815  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1098    15:46:33.008749977  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1142    15:46:34.009075572  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1163    15:46:34.066630442  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1182    15:46:34.207902946  192.168.1.103   120.17.171.29   UDP 64  55552 → 4863 Len=20
1183    15:46:34.208480616  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1216    15:46:34.422260656  192.168.1.103   120.17.171.29   UDP 64  55551 → 4608 Len=20
1282    15:46:34.846769367  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1294    15:46:35.008680790  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1295    15:46:35.067271965  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1296    15:46:35.208519267  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1310    15:46:35.847162119  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1313    15:46:36.008929042  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1341    15:46:36.208701200  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1350    15:46:36.262666178  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1396    15:46:37.008736366  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1404    15:46:37.022072223  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1407    15:46:37.209452251  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1408    15:46:37.262923128  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1409    15:46:37.406779322  192.168.1.103   120.17.171.29   UDP 64  55552 → 4863 Len=20
1425    15:46:38.009426768  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1426    15:46:38.021560816  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1427    15:46:38.210308059  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1470    15:46:39.009779812  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1482    15:46:39.210128935  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1506    15:46:39.363960266  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1530    15:46:40.010325170  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1531    15:46:40.209512607  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1532    15:46:40.363503296  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1534    15:46:40.463420734  192.168.1.103   120.17.171.29   UDP 64  55551 → 4608 Len=20
1561    15:46:41.010280077  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1562    15:46:41.209574563  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1606    15:46:42.010311873  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1607    15:46:42.209508551  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1624    15:46:43.009986789  192.168.1.103   120.17.171.29   UDP 48  55551 → 4608 Len=4
1625    15:46:43.210297702  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4
1627    15:46:43.439491371  192.168.1.103   120.17.171.29   UDP 64  55552 → 4863 Len=20
1743    15:46:44.210336018  192.168.1.103   120.17.171.29   UDP 48  55552 → 4863 Len=4

So it seems we're doing everything correctly here but the packets are still not getting through. It could be a problem with having both nodes on the local network. If the CG nat is doing something like hair-pinning then that would prevent a connection.

CMCDragonkai commented 1 year ago

MatrixAI/Polykey#488 leads us to refactor the TTLs and expiry for NodeConnectionManager and NodeGraph.

See: https://github.com/MatrixAI/Polykey/issues/488#issuecomment-1296542122

But basically until MatrixAI/Polykey#365 is possible, it is necessary to special case the seed nodes:

  1. They must always maintain their connection (specifically proxy connection), so since closing node connection GRPC closes the proxy connection, the node connection TTL must be disabled for these connections. This is necessary to ensure that their NAT external port is maintained (ensuring some address stability), but also ensure that the seed nodes can contact them back if they are behind a NAT for the purposes of signalling hole punch messages.
  2. Their NodeId is not allowed to be removed from the NodeGraph. They must always be preferred to exist even if they aren't responding. Network entry may need to repeatedly try to maintain connection with these nodes. That means a loop that regularly attempts pings on the seed node list is necessary.

MatrixAI/Polykey#365 generalises this process to a random set of X number of nodes in the entire network.

CMCDragonkai commented 1 year ago

I've added task 17. as well. @tegefaulkes tomorrow Tuesday, you want to finish MatrixAI/Polykey#483 and MatrixAI/Polykey#487 and address task 17. too.

CMCDragonkai commented 1 year ago

The final manual test should involve the 2 nodes on the testnet, ensure that they are discovering each other on network entry, and do the NAT to CGNAT.

CMCDragonkai commented 1 year ago

The current architecture of multi-seed nodes doesn't actually shard any work between the nodes. This is due to a couple of reasons:

  1. Without decentralised NAT hole punching, every node connects to every seed node. Thus there's no sharding of proxy connections between seed nodes. This is also necessary because one has to maintain a connection to the seed node, for their public node address to be accurate due to NAT mapping timeouts.
  2. Furthermore during any connection between nodes, their ICE procedure would send a relay signalling message to all seed nodes, and all seed nodes will relay that message to the target node, the target node will then receive multiple relay messages, and coalesce them all via connection locking. Thus there's no sharding of relaying operations between seed nodes. This can be changed through client-side round-robin/random load balanced relay messages. This can work because it is expected that all seed nodes have a connection to all other nodes, and thus any seed node is capable of doing the signal relay operation.

For sharding connections, this has to wait until MatrixAI/Polykey#365. For sharing signal relaying, we can do now.

CMCDragonkai commented 1 year ago

There is special casing now for the seed nodes:

  1. When connecting to seed nodes, no relay messages are sent.
  2. Seed nodes never get removed from the node graph. (This means if seed nodes change, they must get a version update in software).
  3. Seed node connections never get TTLed (and thus their proxy connections are maintained).

During network entry, it's important to retry seed node connections too. But this could be done over time. This means during sync node graph, this has to be a background operation that repeats connection attempts to all the seed nodes. This can be done with an timeout that tries every 1 second to try to connect to the seed nodes with an exponential timeout, doubling to 20 seconds.

@tegefaulkes

CMCDragonkai commented 1 year ago

Those special cases will get removed when MatrixAI/Polykey#365 is done.

CMCDragonkai commented 1 year ago

I've closed MatrixAI/Polykey#487. Copying over the conclusion from here.

This can be closed now. We know a couple of things:

  1. It is not possible to connect to a node on the same network without MatrixAI/js-mdns#1. Thus any connection tests from the same network is bound to fail.
  2. Local NAT simulation tests are working now again according to @tegefaulkes in MatrixAI/Polykey#474.
  3. There are still problems with the testnet nodes failing when automated testnet connection tests terminate/finish the agent process.
  4. Network is still flaky and causes timeout errors as per MatrixAI/Polykey#474.
  5. We know that NAT to CGNAT works. And seed nodes can contact each other.

A final test is required involving NAT to CGNAT and the 2 seed nodes together. In total 4 nodes should be tested. However with the amount of failures we're going to blocked on this until we really simplify our networking and RPC code.

So the priorities are now:

  1. Testnet nodes should not crash when we terminate terminating connections for connecting nodes.
  2. Testnet node should not randomly crash.
  3. Do MatrixAI/js-mdns#1 so that we can connect to nodes within the same network.
  4. Complete a 4-node manual test with 2 seed nodes and 2 connecting nodes.