Closed emmacasolin closed 8 months ago
Seems like our tests aren't simulating long-running behaviour. This is unfortunate, because certain bugs only became apparent after running the nodes on the testnet for longer than 1 hr.
We might need to increase our modelling tooling to help with correctness. As well as monitoring tools/remote debugging tools applied to our testnet to enable observation of memory and CPU cycles. Nodejs provides this via their debugging port. This could be enabled on our testnet nodes. Possibly bootstrapping off the client port, and related to MatrixAI/Polykey#412.
I'm also noticing that the re-connection attempts discussed in MatrixAI/Polykey#413 and MatrixAI/Polykey#415 don't appear when I use the same setup locally (two local agents where the one that gets started first is set as a seed node for the second one), so even if we had tests to simulate long-running behaviour they may not have picked up these issues since they may be limited to the testnet environment.
Also unlike our NAT tests which use conditional testing with describeIf
and testIf
, these tests are conditional on a certain stage in our pipeline. We may continue to use this technique rather than explicitly excluding the tests in the jest config (or using explicit group tagging https://stackoverflow.com/questions/50171932/run-jest-test-suites-in-groups and https://morioh.com/p/33c2bd031589).
This approach means you use describeIf
as well, but then depend on a condition that is only available via environment variables. This would be similar to @tegefaulkes work with docker integration tests which rely on a special test command environment variable.
We could re-use NODE_ENV
or create our own env variable that is then funnelled into the jest.config.js
as global parameters.
The only difference in the testnet environment is running in a docker container. If the docker container is producing different behaviour, that has to be illuminated with the docker integration testing MatrixAI/Polykey#407.
Caveat: when using testnet.polykey.io
you are using NLBs as well... so that adds extra complexity. But this is why we are sticking to the public IPs first.
@emmacasolin you can always run your own docker container locally and set that up as your local "seed node" and have it continuously run while you write tests against it.
Just make sure to feed it all the required env variables, or the parameters and mount the necessary namespaces. Meet with @tegefaulkes about this, he's already doing this currently.
Also unlike our NAT tests which use conditional testing with
describeIf
andtestIf
, these tests are conditional on a certain stage in our pipeline. We may continue to use this technique rather than explicitly excluding the tests in the jest config (or using explicit group tagging https://stackoverflow.com/questions/50171932/run-jest-test-suites-in-groups and https://morioh.com/p/33c2bd031589).This approach means you use
describeIf
as well, but then depend on a condition that is only available via environment variables. This would be similar to @tegefaulkes work with docker integration tests which rely on a special test command environment variable.We could re-use
NODE_ENV
or create our own env variable that is then funnelled into thejest.config.js
as global parameters.
If we continue down the path of reusing describeIf
, it would be optimal to move our imports to be asynchronous under the describe
. That would be similar to our bin commands where we use dynamic imports. Problem is, describe
doesn't support asynchronous callbacks. A long term solution is to use top-level await when it becomes available in jest: https://github.com/facebook/jest/issues/2235#issuecomment-585195125
I'm also noticing that the re-connection attempts discussed in MatrixAI/Polykey#413 and MatrixAI/Polykey#415 don't appear when I use the same setup locally (two local agents where the one that gets started first is set as a seed node for the second one), so even if we had tests to simulate long-running behaviour they may not have picked up these issues since they may be limited to the testnet environment.
This is actually incorrect. I left my local setup going in the background for several hours and both agents had attempted multiple node connections to the other agent when I checked back (so many that they were cut off). At one point when I checked one of the agents was displaying these logs (rapidly) and the other was silent but now both of them are silent. This might be from the refresh buckets queue if this is something that happens every hour? I'm going to try reducing the time between refreshes and adding timestamps to the logs.
As a side note, I think these logs appear to be infinite and constant on the testnet because it's a lot slower than my local machine, so it's only able to attempt a connection every 20 seconds, and since it takes so long to get through them it's refreshing them again by the time it's finished.
I think I know what the main issue causing our "infinite loop" is. The timeout for opening a forward proxy connection is 20 seconds, so if we try to connect to a node that is offline it blocks the refresh buckets queue for 20 seconds. The refresh timer for the refresh buckets queue is an hour (i.e. buckets are added to the queue every hour). 1 hour / 20 seconds is 180 nodes per hour if we try to connect to an offline node for every bucket (which is the case if we only have one node in our node graph and it's offline). Since there are 256 buckets, this means we won't get through all of the buckets within the hour, and buckets will begin to be added to the queue again at the same rate that they're removed. So the queue will have 256-180=76 buckets in it forever (until the node in our node graph comes back online).
I'm not sure if this blocks the entire event loop as well, in which case this is definitely a problem.
The refresh bucket and ping node queues work asynchronously in the background. I don't think it will block the event loop.
The excessive contacting of nodes as part of the refresh bucket queue is not ideal. I think we do need to optimise this but the problem is, how? We need a way to determine if we are not gaining any new information and just stop refreshing buckets for a while. But right now the real problem is that we're attempting to contact the same offline node over and over again. Right now two things come to mind to address this.
Just a note that more aggressive removal of nodes from the node graph such as removing a node if we see it's offline would fix this. However this will lead to use removing nodes that may be temporarily offline. Or worse, if a node's network goes down it will clean out it's nodeGraph.
Pleas create a new PR to tackle this issue, you may wish to incorporate the subissues within this epic too.
This takes over from MatrixAI/Polykey#159. The last few comments is useful https://github.com/MatrixAI/Polykey/issues/159#issuecomment-1180056681 regarding any re-deployment of testnet.
@emmacasolin please check MatrixAI/Polykey#148 in relation to these tests, what needs to be done for that issue.
These tests should go into their own subdirectory tests/testnet and should not be run with the other tests. They should be disabled in our jest config and should only run when explicitly called (which will happen during the integration stage of our pipelines).
This is because these tests call out to the external network. Our check stage unit tests should not require external network access for running those tests, as in those tests should pass even when offline.
In MatrixAI/Polykey#435 I'm proposing the use of directories to represent "allow lists".
Because we now have groups of tests that we want to run during "check stage" (even cross-platform check stage) and groups of tests that we want to run as part of "integration stage", then these tests are part of the integration stage, as they test the integration with testnet.polykey.io
.
Our initial group should just be tests/integration
to indicate integration tests. In there, for this epic, we should have tests/integration/testnet
.
During integration testing (where it is testing each platform), it will also on top of this test against the testnet as well.
So for example @tegefaulkes during the docker integration tests, it will not only be testing tests/integration/docker
, but also tests/integration/testnet
. While for windows it would be tests/integration/windows
and tests/integration/testnet
.
We also nix integration testing, which should be testing tests/integration/nix
and tests/integration/testnet
.
Right now integration testing would mostly reuse tests from tests/bin
(and until MatrixAI/Polykey#435 is resolved, it cannot really change to testing tests/integration/testnet
).
Once this is setup, we can evaluate whether all the regular unit tests including tests/bin
should be moved down one directory to tests/check
.
Now these tests may still fail, so you need to write stubs for all the preconditions and postconditions, taking into account MatrixAI/Polykey#403, and also the changes we will be making in MatrixAI/Polykey#329.
The PR for the issues within this epic should be targeting staging
branch but should be cherry picking changes that are occurring in MatrixAI/Polykey#419 as that's where the initial new DB will be applied and many concurrency issues resolved. Myself and @tegefaulkes will be focusing on MatrixAI/Polykey#419 while @emmacasolin will be working on this issue.
Finally MatrixAI/Polykey#434 and MatrixAI/Polykey#432 should be done first and merged into staging before attempting this PR.
Focus on making a start on all the test cases even if they will be failing for now, fixes should be pushed into staging, or assigned to MatrixAI/Polykey#419.
I added MatrixAI/Polykey#388 as a subtask of this.
As discussed just now.
With the use of domains containing multiple A
records E.G. our testnet testnet.polykey.io
it seems evident that the mapping of NodeId
s to IP
s is a many to many relationship.
Given a node graph with the following mappings, we can end up with 4 cases that express these relationships.
NodeGraph<NodeID, Host> {
// C1 The same node ID on different IPs (NOT POSSIBLE)
NID2 -> IP1,
NID2 -> IP2,
// C2 Multiple NIDs on the same IP
NID5 -> IP3,
NID6 -> IP3,
// C3 The same node ID on different hostnames (NOT POSSIBLE)
NID1 -> HOSTNAME1,
NID1 -> HOSTNAME2,
// C4 Multiple NIDs on the same host name
NID3 -> HOSTNAME3,
NID4 -> HOSTNAME3,
// It's possible to have unions of all 4 cases
NID7 -> IP4,
NID7 -> IP5,
NID7 -> HOSTNAME1,
NID8 -> IP4,
NID8 -> HOSTNAME2,
NID9 -> IP4,
NID9 -> HOSTNAME1
NID9 -> HOSTNAME2
};
The NG isn't aware of the gestalt graph. And neither is it aware of certificate chain relationship. So it's not aware of both axes. Both axes will be dealt with by other systems.
The gestalt graph relationship is dealt with at the "social network level".
The certificate chain relationship is dealt with at the "TLS level".
The node graph deals with just NID -> HOST resolution. But it can be complex here.
Connection Establishment means we connect and Certificate verification passes
C1: is one node to many addresses.
C2: is many nodes to one address.
C3 is one node to many host names.
C4: is many nodes to one host name.
To fully support all of this we need to apply 2 thing.
For the user to attempt a connection to multiple [NID7, NID8, NID9] with a set of IPs/connections.
--seed-nodes="NID1@testnet.polykey.io:1314;NID2@testnet.polykey.io:1314"
// NETWORK ENTRY
Promise.allSettled([
nodeManager.getConnection(nid1),
nodeManager.getConnection(nid2),
]);
Ultimately the idea is that if we go looking for a node we can find that specific node. Each node is unique and what we're after when connecting to nodes is reasonably specific to that node. Kademlia and NAT punch-through depends on this.
The problem with using NLBs here is that we end up with a situation where we have multiple nodes on the same IP and port with no way to discerningly connect to one or the other intentionally. This is not so much a problem when entering the network where we only care about contacting any node that we can trust. for any other interaction we depend of information and state unique to the node we're looking for.
For the infrastructure. the go to structure is that we set up and EC2 instance that runs a single node. To make this work with the ECS we need to apply a few things.
user data
field. This is a small script that sets a variable for the ECS network in a config.New epic created at MatrixAI/Polykey#485 added to this one.
I'm going to create a new issue to track the Polykey infrastructure changes. Just so I don't bloat this issue with the discussion.
There are 3 additional things to be accomplished here:
BucketIndex/NodeId/Host/Port -> null
. MatrixAI/Polykey#482 Manual testing starts this week. Using tailscale to help test home NAT routers.
Upon the merge of https://gitlab.com/MatrixAI/Engineering/Polykey/Polykey-Infrastructure/-/merge_requests/2, we will begin testing.
Let's write down a manual test specification. Upon each setup, please provide a screenshot accordingly. For command executions, use asciinema.
Test spec moved to MatrixAI/Polykey#487.
Cover all our bases by having unique exit codes for:
Log out the exit codes, and replicate the random failure by fuzz testing a local seed node while simulating conditions on the EC2, by running a local docker container image.
Identifying the correct exit code should be able to tell us what is happening.
ALSO check that we are not being killed by something else like the OOM on the operating system.
Otherwise if we cannot get anything useful, we will use strace on the entire nodejs process and run it there, and try and trigger a crash.
Alternatively we use https://rr-project.org/ and see https://fitzgeraldnick.com/2015/11/02/back-to-the-futurre.html
Not all but most of the NAT tests are passing again,
So it appears upon deploying the new image, we do get a successful connection between office (double NAT) to the seed node. No connection timeouts. So at this point we can assume that connections from home (single NAT) to seed node should also work.
@tegefaulkes is merging the fixes to aborting the connections into the staging branch. The default timeouts have also been changed. Summary list:
2000
50
1000
20000
60000
Aborting with 0
should immediately abort the connection now without starting doing anything for UTP.
When doing errorP.then()
this can cause unhandled rejections. We are not supposed to use these, and if you do, an explicit catch
is necessary.
When aborting a connection, all resources of the UTP must be destroyed to avoid underlying side effects. Also destroy the TLS socket and any registered handlers.
So our plan is for the rest of today is to consolidate our manual tests into automatic integration tests:
These 2 tests should be done as part of MatrixAI/Polykey#441 PR. The PR may have existing code that doesn't make sense to be used, in that case they should be sliced out, and pasted into the PR for archiving until we can do a proper multi-node simulation. Then MatrixAI/Polykey#441 PR should be merged into staging.
For tomorrow, we will complete the manual testing involving home to office. @tegefaulkes ensure that your home network is 1 NAT, not double NAT in this case. The VM network must be a bridge. Or better yet, use the laptop, not your VM to do this. We'll confirm both cases are working.
At this point manual testing will be done. We will provide evidence of all of this in the MatrixAI/Polykey#487.
The next step is finishing the automation of our multinode setup. This means:
With respect to random failures. This requires monitoring the situation on ECS, and then reproducing it locally by fuzzing a docker image locally. That can be a new bug.
There are some additional tasks that can be pushed into staging that we are discovering as we are doing the manual testing.
By Friday, we must update the CI deployment, and the CI should be fully passing (including any of the local NAT tests).
In terms of investigating the random failures.
The above last message is correlates to state change events, memory utilisation, and CPU spike:
There are several connection forwards being repeatedly started, but there is not ENDING message? No failure message? No timeout message???
So problems:
When you have finished running your automated test, you must use testsUtils.processExit
to wait for your clients to gracefully exit.
Subsequently on the seed node we are seeing this:
Right now there's no logs telling us WHAT is causing the stopping of the connection reverse. Most likely this is a legitimate stop. We should be adding this.logger.info
into all the places that call ConnectionReverse.stop
inside ConnectionReverse
. So we can identify the "cause of the stop".
Furthermore, there is a ErrorConnectionEndTimeout
. This error is expected due to UTP native having a flaky ending system. I encountered this before, so it shouldn't be a problem.
There is a spurious IP address that exists in the NodeGraph:
It is not @tegefaulkes home IP, nor my IP, nor seed node IP. And the node was restarted at 3pm. So how can this IP exist?
Can you restart the agent completely fresh. And start the test again, and see if you are getting these weird IPs.
Logs from the recent test.
Next problem, in your 2 nodes test to the seed node. After we have restarted the seed node, the spurious IP is no longer there. We are seeing a successful composition of the 2 connections from the 2 nodes:
At this point there's is also no ErrorConnectionEndTimeout
either.
So connections to the seed node work.
However there are no logs for the signalling behaviour that the seednode is meant to do for the 2 nodes.
@tegefaulkes you must add in some new logs for the signalling behaviour in the seed node, probably by putting it into the GRPC service handler that performs this task. This handler should be called sendSignalingMessage
. We should also be observing the response data in your tests as well. You want to log out what is occurring in the client side....
So the client side should be checking the response of the sendSignalingMessage
call. And the response is a negative, that should mean the seed node rejects this attempt to signal things. Log this out as well.
In the decentralised NAT discussion MatrixAI/Polykey#365, we realised that we needed to maintain some connection liveness for the NAT busting to work. That was something like 6 random connections.
In this situation with a centralised seed nodes, we just need to maintain connections to the seed nodes without TTLs.
This means the node connection TTLs must not apply to the nodes designated as the seed nodes. Once we start a node, and it connects to the testnet, it maintains that connection to the testnet forever!!! Figure out the distinction between node connection and proxy connection.
Node connection TTL is based on RPC traffic. Proxy connection TTL is based on keep alive. So what we really need is to maintain the proxy connection which it should be via the keep alive. This will ensure that the external ports in the NodeGraph is always eventually consistent and should converge quickly.
The signaling process just sets up the bidirectional hole punching. As soon as A1 and A2 are sending packets to each other, the signaling is not involved anymore. They just need to send packets to each other and the hole punching to work (up to and including endpoint-independent proxy-restricted NAT).
So @tegefaulkes you need to differentiate the logs between A1 and A2, give them different prefixes, and then observe for the concurrent connection forwards to each other.
I think we should also change the default logger format for PK to output keys
and not key
. This will be more useful for us.
It's important to remember that we are sending UDP packets. In such a case, port preservation is not required, and hole punching should work.
Of course symmetric NAT will still be unpunchable. But that's not going to be the problem here with a home router.
As soon as the 2 nodes to testnet is working (that means the signaling process is working), then I suspect the home to office manual test should also work. So this is the next priority after you've merged that PR.
Reference from libp2p's implementation that they only did recently: https://blog.ipfs.tech/2022-01-20-libp2p-hole-punching/
MatrixAI/Polykey#441 has been merged.
I've been running the agent since yesterday, over 24 hours. No crash atm. The task manager and related things have been working fine. So we'll push back the analysis of why it's crashing up on ECS to the last set of bugs to look into.
Some things to change for the NodeConnectionManager.relaySignallingMessage
:
sourceAddress
. This is the address that is to be acquired by the connectionInfoGetter
at the GRPC handler level.sourceAddress
along with the existing message and send a new signal message to the next node.NodeConnectionManager.relaySignallingMessage
. If it is, then it uses the message's source address properties, and if that doesn't exist, it will throw an exception which should bubble back to the sender. Once it has the message's source address properties, it will hole punch back there.The reason we use connectionInfoGetter
instead of taking it from the NG is because the updating the NG with the new source address for a given incoming connection may be asynchronous. So by using the connectionInfoGetter
, we are always using the most up to date information to relay it.
At the same time, the node connection TTL should be disabled or set to very high amount of time to eliminate this factor in our testing. Subsequently the node connection TTL should be disabled for node connections to seed nodes, or at the very least, the underlying proxy connection should be maintained to the seed nodes forever.
As long as we see N1 have a ConnectionForward
to N2 and simultaneously N2 have a ConnectionReverse
to N1, then we can at the very least say that the signalling operation is successful.
Whether the packets actually go through or not will depend on the CGNAT in our office not acting like a symmetric NAT. So that's a second layer check.
To confirm that this target. @tegefaulkes get evidence of the packets from both ends N1 and N2 on your local wireshark, and also correlate with the expected logs of ConnectionForward
on N1 and ConnectionReverse
on N2.
Ok, so I've gotten everything set up and run the test. Here is a log of the test from the local perspective.
Based on these results I can tell the following
Looking at how the ping node is implemented. It works in 2 stages.
Here it is failing at stage 1, with the error ErrorNodeGraphNodeIdNotFound
.
So the reason is that node information is not added to the NodeGraph
until a connection is made. But in this case, ICE must be done... and it seems ICE is not being done if "The signalling messages are never sent".
The solution may be to factor the ICE so there's only 1 place in the codebase, probably the NodeConnectionManager
to do the ICE. Centralise it to that single method there. So that any time there's an attempt to connect to a node, we should see an ICE attempt.
There was a bug with the pingNode implementation. When checking if the target was a seed node it was always returning true. As a result it was never sending the signalling packers.
With that fixed I have some progress. I can see the signalling packets being sent now and the contain the correct information. Both nodes are sending their ping packets to establish the connection to the right destinations.
824 15:46:28.650004791 192.168.1.103 120.17.171.29 UDP 64 55551 → 4608 Len=20
825 15:46:28.651062640 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
880 15:46:30.215503835 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
915 15:46:31.006846790 192.168.1.103 120.17.171.29 UDP 64 55551 → 4608 Len=20
916 15:46:31.008337907 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1064 15:46:31.824906773 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1082 15:46:31.905726197 192.168.1.103 120.17.171.29 UDP 64 55552 → 4863 Len=20
1083 15:46:31.907199992 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1085 15:46:32.008758907 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1093 15:46:32.825800648 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1097 15:46:32.908174815 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1098 15:46:33.008749977 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1142 15:46:34.009075572 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1163 15:46:34.066630442 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1182 15:46:34.207902946 192.168.1.103 120.17.171.29 UDP 64 55552 → 4863 Len=20
1183 15:46:34.208480616 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1216 15:46:34.422260656 192.168.1.103 120.17.171.29 UDP 64 55551 → 4608 Len=20
1282 15:46:34.846769367 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1294 15:46:35.008680790 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1295 15:46:35.067271965 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1296 15:46:35.208519267 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1310 15:46:35.847162119 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1313 15:46:36.008929042 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1341 15:46:36.208701200 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1350 15:46:36.262666178 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1396 15:46:37.008736366 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1404 15:46:37.022072223 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1407 15:46:37.209452251 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1408 15:46:37.262923128 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1409 15:46:37.406779322 192.168.1.103 120.17.171.29 UDP 64 55552 → 4863 Len=20
1425 15:46:38.009426768 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1426 15:46:38.021560816 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1427 15:46:38.210308059 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1470 15:46:39.009779812 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1482 15:46:39.210128935 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1506 15:46:39.363960266 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1530 15:46:40.010325170 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1531 15:46:40.209512607 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1532 15:46:40.363503296 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1534 15:46:40.463420734 192.168.1.103 120.17.171.29 UDP 64 55551 → 4608 Len=20
1561 15:46:41.010280077 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1562 15:46:41.209574563 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1606 15:46:42.010311873 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1607 15:46:42.209508551 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1624 15:46:43.009986789 192.168.1.103 120.17.171.29 UDP 48 55551 → 4608 Len=4
1625 15:46:43.210297702 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
1627 15:46:43.439491371 192.168.1.103 120.17.171.29 UDP 64 55552 → 4863 Len=20
1743 15:46:44.210336018 192.168.1.103 120.17.171.29 UDP 48 55552 → 4863 Len=4
So it seems we're doing everything correctly here but the packets are still not getting through. It could be a problem with having both nodes on the local network. If the CG nat is doing something like hair-pinning then that would prevent a connection.
MatrixAI/Polykey#488 leads us to refactor the TTLs and expiry for NodeConnectionManager
and NodeGraph
.
See: https://github.com/MatrixAI/Polykey/issues/488#issuecomment-1296542122
But basically until MatrixAI/Polykey#365 is possible, it is necessary to special case the seed nodes:
NodeId
is not allowed to be removed from the NodeGraph. They must always be preferred to exist even if they aren't responding. Network entry may need to repeatedly try to maintain connection with these nodes. That means a loop that regularly attempts pings on the seed node list is necessary.MatrixAI/Polykey#365 generalises this process to a random set of X number of nodes in the entire network.
I've added task 17. as well. @tegefaulkes tomorrow Tuesday, you want to finish MatrixAI/Polykey#483 and MatrixAI/Polykey#487 and address task 17. too.
The final manual test should involve the 2 nodes on the testnet, ensure that they are discovering each other on network entry, and do the NAT to CGNAT.
The current architecture of multi-seed nodes doesn't actually shard any work between the nodes. This is due to a couple of reasons:
For sharding connections, this has to wait until MatrixAI/Polykey#365. For sharing signal relaying, we can do now.
There is special casing now for the seed nodes:
During network entry, it's important to retry seed node connections too. But this could be done over time. This means during sync node graph, this has to be a background operation that repeats connection attempts to all the seed nodes. This can be done with an timeout that tries every 1 second to try to connect to the seed nodes with an exponential timeout, doubling to 20 seconds.
@tegefaulkes
Those special cases will get removed when MatrixAI/Polykey#365 is done.
I've closed MatrixAI/Polykey#487. Copying over the conclusion from here.
This can be closed now. We know a couple of things:
- It is not possible to connect to a node on the same network without MatrixAI/js-mdns#1. Thus any connection tests from the same network is bound to fail.
- Local NAT simulation tests are working now again according to @tegefaulkes in MatrixAI/Polykey#474.
- There are still problems with the testnet nodes failing when automated testnet connection tests terminate/finish the agent process.
- Network is still flaky and causes timeout errors as per MatrixAI/Polykey#474.
- We know that NAT to CGNAT works. And seed nodes can contact each other.
A final test is required involving NAT to CGNAT and the 2 seed nodes together. In total 4 nodes should be tested. However with the amount of failures we're going to blocked on this until we really simplify our networking and RPC code.
So the priorities are now:
Specification
We need a suite of tests to cover interacting with a deployed agent, which we can do using
testnet.polykey.io
. These tests need to cover various different connection scenarios:These tests should go into their own subdirectory
tests/testnet
and should not be run with the other tests. They should be disabled in our jest config and should only run when explicitly called (which will happen during the integration stage of our pipelines).Required tests:
tests/testnet/testnetConnection.test.ts
tests/testnet/testnetPing.test.ts
tests/testnet/testnetNAT.test.ts
Additional context
testnet.polykey.io
Tasks
testnet.polykey.io
--log='/regex/'
.~ - not relevant to integration testing, thejs-logger
does support REGEX filtering, but the PK CLI currently doesn't have this option.agent status
command needs to display useful information like the polykey version and other useful statistics like active connections, number of node graph entries etc etc.~--client-host
needs to support host names.~ - this is pending a change to being able to usePolykeyClient
to connect to a host name - which would require using the DNS-SD SRV records. This still needs to be specced out how this would work because in some cases you want to connect to a SINGLE Node, in other cases you are "discovering" a node to connect to, but it's not relevant to this epic.src/config.ts
- from MatrixAI/Polykey#488Emergent bugs
ConnectionReverse.start()
betweenStarting Connection Reverse
andStarted Connection Reverse
. Start by testing locally.