Integration Tests for `testnet.polykey.com`

emmacasolin commented 2 years ago

Specification

We need a suite of tests to cover interacting with a deployed agent, which we can do using testnet.polykey.io. These tests need to cover various different connection scenarios:

Connecting to a deployed agent as a seed node during startup
Pinging a deployed agent
Using the deployed agent as a signaller (and eventually a relay once this is implemented)
Tests for when the deployed agent is contacting an agent behind a NAT (https://github.com/MatrixAI/js-polykey/issues/159)
Any bugs that are discovered during the above tasks

These tests should go into their own subdirectory tests/testnet and should not be run with the other tests. They should be disabled in our jest config and should only run when explicitly called (which will happen during the integration stage of our pipelines).

Required tests:

tests/testnet/testnetConnection.test.ts
- Can connect to the testnet
- Within a reasonable amount of time
- Without errors/shutting down the local agent
- Without errors/shutting down the testnet
- Can disconnect from the testnet
- Within a reasonable amount of time
- Without errors/shutting down the local agent
- Without errors/shutting down the testnet
- Can reconnect to the testnet
- Able to handle different node ids (testnet is a cluster of nodes)
tests/testnet/testnetPing.test.ts
- Can ping the testnet
- Able to handle different node ids (testnet is a cluster of nodes)
- Can ping another node via the testnet (signaling)
- Can ping another node via the testnet (relay)
- Can attempt to ping another node that doesn't exist
- Without shutting down the testnet
tests/testnet/testnetNAT.test.ts
- Can ping a node that is behind endpoint-independent NAT via the testnet
- From a node that is not behind a NAT (DMZ)
- From a node that is behind endpoint-independent NAT
- From a node that is behind endpoint-dependent NAT
- Can ping a node that is behind endpoint-dependent NAT via the testnet
- From a node that is not behind a NAT (DMZ)
- From a node that is behind endpoint-independent NAT
- From a node that is behind endpoint-dependent NAT
Should also incorporate tests from https://github.com/MatrixAI/Polykey/pull/326#issuecomment-1041190329

Additional context

https://github.com/MatrixAI/js-polykey/issues/159 - NAT traversal tests against testnet.polykey.io
https://github.com/MatrixAI/js-polykey/issues/414 - Issue for failed connections to/from a deployed agent
https://github.com/MatrixAI/js-polykey/issues/415 - Issue for continuous re-connection bug when connecting to deployed seed node
https://github.com/MatrixAI/js-polykey/issues/413 - Issue for potential infinite loop when connecting to deployed seed node

Tasks

[x] 1. Attempt connections to the deployed seed node and create issues for all bugs discovered (and resolve them)
[x] 2. Create tests for simple connections to testnet.polykey.io
- 1 node connected to testnet.polykey.io and maintains connection
- 2 nodes connected to testnet.polykey.io and can ping each other (they will have the same IP but different ports)
~[ ] 3. Create tests for edge cases and previous bugs~ - most edge cases will go to the simulation suite
~[ ] 4. Create tests for connecting to a deployed seed node from behind a NAT~ - this can only be done as part of a simulation suite, since we don't control host firewalls
[x] 5. Finish off all diagrams as part of NAT testing MatrixAI/Polykey#388
[x] 6. Add new INFO level logs for situations where connections are going into stopping. This is next to the debug logs.
~[ ] 7. Add a logging debug filter to command line arguments to take a regular expression. Should be global option like --log='/regex/'.~ - not relevant to integration testing, the js-logger does support REGEX filtering, but the PK CLI currently doesn't have this option.
~[ ] 8. Use https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerOverride.html to be able to easily try different debugging levels and filters.~ - cannot use container overrides in services, can be done as part of tasks though, will have to just redeploy service with different task definition each time
~[ ] 9. agent status command needs to display useful information like the polykey version and other useful statistics like active connections, number of node graph entries etc etc.~
~[ ] 10. --client-host needs to support host names.~ - this is pending a change to being able to use PolykeyClient to connect to a host name - which would require using the DNS-SD SRV records. This still needs to be specced out how this would work because in some cases you want to connect to a SINGLE Node, in other cases you are "discovering" a node to connect to, but it's not relevant to this epic.
[x] 11. EC2 setup with idempotency
[x] 12. Multi Node Setup on AWS
[x] 13. Recovery Code Pool on AWS
[x] 14. Multi-Host DNS resolution
[x] 15. Multi Node Resolver
[x] 16. NodeGraph KeyPath to lift Host and Port to the key path
[x] 17. Put the trusted testnet seed nodes into the src/config.ts - from MatrixAI/Polykey#488

Emergent bugs

[x] 1. Seed node is failing to establish a reverse connection back to the connecting node. The problem is somewhere in ConnectionReverse.start() between Starting Connection Reverse and Started Connection Reverse. Start by testing locally.
[x] 2. Fixing the Random Process Termination
- https://github.com/MatrixAI/Polykey-CLI/issues/71
- https://github.com/MatrixAI/Polykey-CLI/issues/71
[x] 3. Verify that new timeouts are being respected
- https://github.com/MatrixAI/Polykey-CLI/issues/71

CMCDragonkai commented 11 months ago

As per MatrixAI/Polykey#551, we now have a successful connection to a stable running testnet in testnet.polykey.com.

It also makes sense to have a test suite for integrating to testnet.polykey.com here in Polykey repo, but that runs separately to the CI's main pipeline, so it doesn't block anything because the testnet can be a bit flaky. However as we go ahead it should become more stable.

We should start writing some simple tests that can be separately run, and separate from npm run test script. One way is grouping, or another is just by directory. If with directory it would be important to subdirectory all of the unit tests.

amydevs commented 9 months ago

I've changed the docker integration tests temporarily to simply run the image with docker run. This is so that we can get a github release with the working binary executables, otherwise integration:docker will fail on some tests.

What needs to be done:

The problem at hand is that the tests are binding the agent socket to the localhost interface. This will need to be changed to an ipv4 supported wildcard interface (::, 0.0.0.0) etc. The tests are timing out because the node ChildProcess is unable to kill the agent when in the process is in a state that the program that docker is running is already crashed. The process is crashing in these tests, because they are attempting to connect to the testnet while being bound on a localhost interface, so the agent will not be able to send packets to any globally routable ip addresses. Hence, by specifying the globally routable ip address of a testnet node, it will throw an EINVAL, noting that a globally routable address is an invalid argument to send command on a socket that is bound to a localhost interface.

Notes about container network behaviour:

Untitled-2023-10-23-0424 excalidraw(6)

CMCDragonkai commented 9 months ago

Moving this to Polykey-CLI since integration tests of this sort can only be done as a "process".

Although lighter integration tests should still be in PK the library.

CMCDragonkai commented 9 months ago

@tegefaulkes this issue can be closed once:

In the CI job for PK-CLI we introduce polling calls to Polykey-Network-Status to ask if the currently distributed/released image version has been deployed for testnet.polykey.com to then run the integration tests.
This means @amydevs you want to expose on the API of PK-N-S that the currently deployed version.
If the version isn't deployed in sufficient time... then that would block the rest of the CI, that's ok. The pipeline will timeout and we would restart the pipeline afterwards.

This can then re-enable integration tests in our integration jobs after deployment. And those nodes should be connecting to the testnet and doing all the tests.

To do this the the docker integration tests need to bind to wildcard address to avoid problems with connecting to the internet - the testnet.

That means for now, this issue is blocked on @amydevs completing #599.

Also I've removed the 2 subissues relating to NAT testing, because those would need to be done in the PKI - not here.

CMCDragonkai commented 9 months ago

This epic is almost ready to be closed. @tegefaulkes focus on getting the integration tests cleaned up and working. And work with @amydevs to get the API call to testnet.polykey.com/api to be able to know what the current version is. Work out a spec for what the API should return, and how you would know. As well as timeout - sufficient time for deployment.

CMCDragonkai commented 9 months ago

To clarify all tests prior to integration tests should be configured to not connect to any network at all, unless it's simulating a local network within the tests.

tegefaulkes commented 9 months ago

We're going to streamline how the integration tests work. This is going to be done with the following changes.

All the standard tests will no attempt connections to any network. They should explicitly be started with no seed nodes.
The integration tests will be separated from the standard tests. a. remove usage of the testif utility from the standard tests. b. integration tests will be in a separate folder structure from the standard tests. integration tests will focus on connecting to the testnet.
integration tests need to wait for the testnet to be updated. to this end the following changes are to be made. a. CLI ci job for for triggering the testnet seed nodes using the new image will be removed. This will be handled by the testnet infrastructure. b. A job will be created to wait for the seed nodes to switch to the new version. This will be done by polling a polykey dashboard endpoint that will either return the when all seednodes have been updated, or list the versions of all the seed nodes. This will need to be speced out.

I'm going to make a new issue to track this work and add it to this epic.

CMCDragonkai commented 9 months ago

There are some ideas for tests coming from the OP spec:

Functionally disconnecting from the testnet seed nodes and reconnecting to it.
Using tc or firewall rules to break the connection to a particular node, and then seeing how PK reacts to that, and also re-enabling a few seconds later.

CMCDragonkai commented 9 months ago

Also our current simulated NAT tests have been disabled for some time:

»» ~/Projects/Polykey-CLI/tests/nat
 ♖ tree .                                                                                                 (staging) pts/7 9:53:42
.
├── DMZ.test.ts
├── endpointDependentNAT.test.ts
├── endpointIndependentNAT.test.ts
└── utils.ts

1 directory, 4 files

These tests can be adapted to a Polykey Infrastructure to test it at scale. It might be more "maintainable" if we do it via AWS rather than simulating it locally which has alot of constraints on the platform.

tegefaulkes commented 8 months ago

This is done now except for 1 minor change that still needs to be done. I'll be creating an issue for that as I can't deal with it now.

MatrixAI / Polykey-CLI