MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
31 stars 4 forks source link

Manual Testing of `testnet.polykey.io` #487

Closed CMCDragonkai closed 1 year ago

CMCDragonkai commented 2 years ago

Signalling Triad:

signalling triad

Tasks

Setup

  1. Setup the testnet.polykey.io as the the default testnet seed node configuration in source code. Put it in ./src/config.ts.
  2. CloudWatch no longer has information from NLB. Therefore we do not have connection visualisation at this point.

Client Service

  1. [x] Start Polykey locally and run an agent status command. Provide it to the --client-port of 1315 and --client-host of the static IP address.
  2. [x] Success is if we see an agent status report.

Office Node to Seed Node

The office is a carrier-grade NATted network. Therefore it has to contact the seed node and maintain a connection.

  1. [x] Start Polykey locally npm run polykey -- agent start. Observe a successful command execution.
  2. [x] Observe the agent logs and check if a connection has been made.
  3. [x] Observe the cloudwatch logs of the remote agent and check if a connection has been made.
  4. [x] Success is if the DNS resolution resolve to the IP address, and the NodeGraph has been setup.
  5. [x] Adjust automated test for tests/integration in MatrixAI/Polykey#441 to suit.

Home Node to Seed Node

Home is a regular NAtted network. Therefore it has to contact the seed node and maintain a connection.

  1. [x] Start Polykey locally. Observe a success command execution.
  2. [x] Observe agent logs.
  3. [x] Observe cloudwatch logs.
  4. [x] Success is if the DNS resolution resolve to the IP address, and the NodeGraph has been setup.

Note that automated tests which test connection startup to seed nodes is only one way. There's no way to simulate 2 nodes on the tests atm. So we will only do the required tests as above for tests/testnet/testnetConnection.test.ts.

Office Node to Home Node

  1. [x] Start Polykey in office system, and Polykey on home system
  2. [x] Observe agent logs for signaling operations
  3. [x] Observe that the hole punching works

CGNAT to CGNAT - Office (CGNAT) to Home (CGNAT)

This would resolve MatrixAI/Polykey#383. This would be necessary for mobile networks.

This is low priority atm, so we will put this to MatrixAI/Polykey#383.

Multiple Office/Home Nodes to Seed Node

We should be able to start 2 nodes on different ports on the same machine to connect to the same testnet. This should be integrated into MatrixAI/Polykey#441.

  1. [ ] Repeat Office Node to Seed Node for 2 nodes on different ports.
  2. [ ] Integrate the change into MatrixAI/Polykey#441.

It turns out this is actually more complex. It's not possible to do this without the router performing hairpinning. And most routers don't have hairpinning. See: https://github.com/tailscale/tailscale/issues/188 and https://github.com/MatrixAI/Polykey/issues/487#issuecomment-1294742114

tegefaulkes commented 2 years ago

We can get the status of the seed node. I think the status should show the current version of Polykey. We can take that from the package.json right?

``` [nix-shell:~/matrixcode/polykey/js-polykey]$ npm run polykey -- agent status --node-id v1mnaq2ppfrbfk5le1i7j68p5sodh3904v12lp4u04pflq1gumks0 --client-host 3.106.178.29 --client-port 1315 -v > polykey@1.0.1-alpha.0 polykey > ts-node src/bin/polykey.ts "agent" "status" "--node-id" "v1mnaq2ppfrbfk5le1i7j68p5sodh3904v12lp4u04pflq1gumks0" "--client-host" "3.106.178.29" "--client-port" "1315" "-v" INFO:PolykeyClient:Creating PolykeyClient INFO:Session:Creating Session INFO:Session:Setting session token path to /home/faulkes/.local/share/polykey/token INFO:Session:Starting Session INFO:Session:Started Session INFO:Session:Created Session INFO:GRPCClientClient:Creating GRPCClientClient connecting to 3.106.178.29:1315 INFO:GRPCClientClient:Created GRPCClientClient connecting to 3.106.178.29:1315 INFO:PolykeyClient:Starting PolykeyClient INFO:PolykeyClient:Started PolykeyClient INFO:PolykeyClient:Created PolykeyClient ✔ Please enter the password … ****** INFO:PolykeyClient:Stopping PolykeyClient INFO:GRPCClientClient:Destroying GRPCClientClient connected to 3.106.178.29:1315 INFO:GRPCClientClient:Destroyed GRPCClientClient connected to 3.106.178.29:1315 INFO:Session:Stopping Session INFO:Session:Stopped Session INFO:PolykeyClient:Stopped PolykeyClient status LIVE pid 1 nodeId v1mnaq2ppfrbfk5le1i7j68p5sodh3904v12lp4u04pflq1gumks0 clientHost 0.0.0.0 clientPort 1315 proxyHost 0.0.0.0 proxyPort 1314 agentHost 127.0.0.1 agentPort 42793 forwardHost 127.0.0.1 forwardPort 39185 rootPublicKeyPem -----BEGIN PUBLIC KEY----- MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAvGV2S76OPXIW3aap6j2o lH6BxsJchhUKcIA+kxttXTN/AaXue+8qDp+mpqagHvk5aiZyJ6eGEmDWyqTUle+f uESb8A3CkWy+neUDFm6k++Psyvy4Lsblhil3lm3PqgfFG0vjQ3DKUVfn9dq2Bl4c +63ArqJioD6Q78+hdxPCbKwP8yn0tFUfw3YzzaXEBBUTLbsW4B5VzNA5jtgqPF3N Dvcl0RqMFEij38IHWgYyoe1jxIWwuQJJ6q/Wxl42t3iLFMrsugz8n8AAFscsvkj8 4Pik52Rk8Si+/6xh+Qq3GG58ov5HECNH/+ckZdhOdJYYcd5SzzS7L6mGu87ng6M+ coiI/HbkiX6CveCiu6MMJbK3Co3IqY4Pcx2OGzm17JY3maidWYQg8TrSwPSKypN+ V4/vCKVnRIxRccFFdEnykxmSwVcsRypv85+PZEX2ofj5Bw5EoumRsN9bdxCZrqAD vPJcjmsdCI39rhgmx7btY3w68T5JzGxig3qSpERL5DpVpUDvl5s5AJYWEgjiF9Bv bKZ5LwHerz6SY6QK/vQbwrZKYnYhR7ZXMkZKj98yQVmKiIlgAzyIcdCxO9oSDYdH /1kylZwYJnfS1osMGHXLKaXu2yKhto9qVjw/t/hlanBvR4yrENSzDrurL1a2d0uA Ra8EBSILNxX9xJY3M3M6/u8CAwEAAQ== -----END PUBLIC KEY----- rootCertPem -----BEGIN CERTIFICATE----- MIIIKDCCBhCgAwIBAgIFFmZlgpIwDQYJKoZIhvcNAQELBQAwQDE+MDwGA1UEAxM1 djFtbmFxMnBwZnJiZms1bGUxaTdqNjhwNXNvZGgzOTA0djEybHA0dTA0cGZscTFn dW1rczAwHhcNMjIxMDI1MDAzODExWhcNMjMxMDI1MDAzODExWjBAMT4wPAYDVQQD EzV2MW1uYXEycHBmcmJmazVsZTFpN2o2OHA1c29kaDM5MDR2MTJscDR1MDRwZmxx MWd1bWtzMDCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBALxldku+jj1y Ft2mqeo9qJR+gcbCXIYVCnCAPpMbbV0zfwGl7nvvKg6fpqamoB75OWomcienhhJg 1sqk1JXvn7hEm/ANwpFsvp3lAxZupPvj7Mr8uC7G5YYpd5Ztz6oHxRtL40NwylFX 5/XatgZeHPutwK6iYqA+kO/PoXcTwmysD/Mp9LRVH8N2M82lxAQVEy27FuAeVczQ OY7YKjxdzQ73JdEajBRIo9/CB1oGMqHtY8SFsLkCSeqv1sZeNrd4ixTK7LoM/J/A ABbHLL5I/OD4pOdkZPEovv+sYfkKtxhufKL+RxAjR//nJGXYTnSWGHHeUs80uy+p hrvO54OjPnKIiPx25Il+gr3gorujDCWytwqNyKmOD3Mdjhs5teyWN5monVmEIPE6 0sD0isqTfleP7wilZ0SMUXHBRXRJ8pMZksFXLEcqb/Ofj2RF9qH4+QcORKLpkbDf W3cQma6gA7zyXI5rHQiN/a4YJse27WN8OvE+ScxsYoN6kqRES+Q6VaVA75ebOQCW FhII4hfQb2ymeS8B3q8+kmOkCv70G8K2SmJ2IUe2VzJGSo/fMkFZioiJYAM8iHHQ sTvaEg2HR/9ZMpWcGCZ30taLDBh1yyml7tsiobaPalY8P7f4ZWpwb0eMqxDUsw67 qy9WtndLgEWvBAUiCzcV/cSWNzNzOv7vAgMBAAGjggMnMIIDIzAMBgNVHRMEBTAD AQH/MAsGA1UdDwQEAwIC9DA7BgNVHSUENDAyBggrBgEFBQcDAQYIKwYBBQUHAwIG CCsGAQUFBwMDBggrBgEFBQcDBAYIKwYBBQUHAwgwEQYJYIZIAYb4QgEBBAQDAgD3 MFgGA1UdEQRRME+CNXYxbW5hcTJwcGZyYmZrNWxlMWk3ajY4cDVzb2RoMzkwNHYx MmxwNHUwNHBmbHExZ3Vta3MwhwR/AAABhxAAAAAAAAAAAAAAAAAAAAABMB0GA1Ud DgQWBBSYGAJFOw0Nh2uS6/OdwQUbk2yz/TAhBgsrBgEEAYO+TwICAQEB/wQPVg0x LjAuMS1hbHBoYS4wMIICGAYLKwYBBAGDvk8CAgIBAf8EggIERIICAI4zZGHhU4Ff vxrVdHRR3LQejwMMZ69ETzJaIaRknWCPBScgaZpt74FtUXGYygzXZsqc5VszuDmc tMi5W5E2jrG2h76eH9kp3WTC+BJC/kwZ3YMk1KdG+XasbUYkVdJBTNZ4cB60APE3 x+R5wImYcZw2sVlcQOuJbIhVdEkkR0JN8IbBBvXXgno03hQyFlsz79Z2jNc0a1vS NdBOPjDrQ+unm7EYhG8C6s2jPqaPF1vPmfoH5mPulNGjEuRJ/403LjvQBaIM/l/G 9olCc5ZJxqyWFyu4VZ3vkSmtJtNXdE4r24RnRt7a3b+uzq55hT9/8ai1PrZAofWK NwO1baqptSRbimLKceB/HJMfbQcpQn7cYkDNY+cKuGZfvaawBKTg2X4a2A/B0tRF bohlQmLHOfg8clXETDTb1fI2YoNzMXhlyrqAyLwMxwsBwEpnw6rrv1nDJdz96hAZ IKZgrVlQqf+0dBA+F93norRTL1uChhbmCQgbj6MiVT/hl7X9odvpCQMF/oI5WlGJ rvNEE1xNMuBV02OJ9M66VVCsUoS6zc9xQ2McZkVe/SDdQPO4e77uOSn5gdgnJUVL 3iDFHujos/PDFT2WX0KdOkN3M9CoFpn9zN32Aomp+RC6kgmknZf3jq5Z6eU5Uzkp 2oomASDDvfBQSBQrHDNUK8HQ/kmJKut2MA0GCSqGSIb3DQEBCwUAA4ICAQChe0vw S9wQpuMTC+KVgV0Bcb0Fd5mt6H4hvBeHR0d3v1322vEYs55XAq0UI1XfDc/8KjWH IOg1MczcgRCpLXVBxxPRT7F8Fj2Qz6XpWpcvuDqDdBHOAt+/ou4Xy8ZB85G1wD5R x5AcH6UGrjWaXtBnnf18ZYnd5snlxmnbWo/mOznb4pmtVHbAl9d1jab51iRNFbdF tcAlZ6w6zjxsgLPm9Q4oUXpXjz1uEhmjLkf0QroBZuakFqVOihm6mtA7paon54A3 PmlKkaXT3lktgxniLZ3i9CJTrs84MkswQASiB7l7xh15fsLZnb7+kWnfZ0kyJJNV yUV9gF9VKhJfJIbjeRF7oEBdicwGC9NGN1a1TJihs0djSV50605I6jhPOK9GZGcz luABe1HwKuNRwlM4O+I7CRgYGbX6T+ee2xXUDuCtP7u+GeniYAtcepVCcNpd5mRx QUavJdRCgLZ8jQby1gjaRect7FS8uAujHciqJweX6Z4xUzYSv1OAmagNKftVXwed A/AYNwxyfldskoCmUrZXkOHlVKLZhCaYFEGvJBSCGkNvrFXnZMvawrzHN4Bf2fAM NQ84CIcHBL0sSQYDU7lxtN++AZ79sM3Sdt5mJSLM9hA+zpFA7dYm7R2vpSC6NSMv L0yTxwuI3Jv36EPMOnIHF0CTZIwyGJ57+z1c1w== -----END CERTIFICATE----- ```
tegefaulkes commented 2 years ago

Indecently the client commands doesn't support the client hostname?

``` [nix-shell:~/matrixcode/polykey/js-polykey]$ npm run polykey -- agent status --node-id v1mnaq2ppfrbfk5le1i7j68p5sodh3904v12lp4u04pflq1gumks0 --client-host testnet.polykey.io --client-port 1314 > polykey@1.0.1-alpha.0 polykey > ts-node src/bin/polykey.ts "agent" "status" "--node-id" "v1mnaq2ppfrbfk5le1i7j68p5sodh3904v12lp4u04pflq1gumks0" "--client-host" "testnet.polykey.io" "--client-port" "1314" error: option '-ch, --client-host ' argument 'testnet.polykey.io' is invalid. Host must be an IPv4 or IPv6 address Usage: polykey agent status [options] Get the Status of the Polykey Agent Options: -np, --node-path Path to Node State (default: "/home/faulkes/.local/share/polykey", env: PK_NODE_PATH) -pf, --password-file Path to Password -f, --format Output Format (choices: "human", "json", default: "human") -v, --verbose Log Verbose Messages (default: 0) -ni, --node-id (env: PK_NODE_ID) -ch, --client-host Client Host Address (env: PK_CLIENT_HOST) -cp, --client-port Client Port (env: PK_CLIENT_PORT) -h, --help display help for command ```
CMCDragonkai commented 2 years ago

Can you create task to support hostnames for client commands. Specifically when we "connect to" we should support hostnames, but if we are "listening", then they should not be allowed.

CMCDragonkai commented 2 years ago

The status can show the current PK version. This is already available as part of the config.sourceVersion. Add as new task.

tegefaulkes commented 2 years ago

Starting the agent with the seed node has a problem. It's failing to fully connect to the seed node. We can see the connection in the seed node's logs.

Also note, failing to connect to the seed node here is causing a crash. The seed node still works.

``` [nix-shell:~/matrixcode/polykey/js-polykey]$ npm run polykey -- agent start --node-path tmp/PK5 -v --seed-nodes v1mnaq2ppfrbfk5le1i7j68p5sodh3904v12lp4u04pflq1gumks0@testnet.polykey.io:1314 > polykey@1.0.1-alpha.0 polykey > ts-node src/bin/polykey.ts "agent" "start" "--node-path" "tmp/PK5" "-v" "--seed-nodes" "v1mnaq2ppfrbfk5le1i7j68p5sodh3904v12lp4u04pflq1gumks0@testnet.polykey.io:1314" ✔ Please enter the password … ******** INFO:PolykeyAgent:Creating PolykeyAgent ... INFO:Proxy:Starting Forward Proxy from 127.0.0.1:0 to 0.0.0.0:0 and Reverse Proxy from 0.0.0.0:0 to 127.0.0.1:44333 INFO:Proxy:Started Forward Proxy from 127.0.0.1:46073 to 0.0.0.0:54060 and Reverse Proxy from 0.0.0.0:54060 to 127.0.0.1:44333 INFO:NodeConnectionManager:Starting NodeConnectionManager WARN:NodeManager:Duplicate refreshBucket task was found for bucket 255, cancelling INFO:NodeConnectionManager:Started NodeConnectionManager INFO:NodeManager:Syncing nodeGraph INFO:ConnectionForward 3.106.178.29:1314:Starting Connection Forward INFO:NodeConnection 3.106.178.29:1314:Creating NodeConnection INFO:clientFactory:Creating GRPCClientAgent connecting to 3.106.178.29:1314 INFO:Proxy:Handling CONNECT to 3.106.178.29:1314 INFO:NodeConnection 3.106.178.29:1314:Destroying NodeConnection INFO:NodeConnection 3.106.178.29:1314:Destroyed NodeConnection ErrorNodeConnectionTimeout: Polykey error exitCode 69 timestamp Tue Oct 25 2022 12:45:10 GMT+1100 (Australian Eastern Daylight Time) cause: ErrorGRPCClientTimeout: Client connection timed out exitCode 69 timestamp Tue Oct 25 2022 12:45:10 GMT+1100 (Australian Eastern Daylight Time) cause: ErrorGRPCClientTimeout: Client connection timed out exitCode 69 timestamp Tue Oct 25 2022 12:45:10 GMT+1100 (Australian Eastern Daylight Time) ``` The seed node's logs. time|log -- | -- 2022-10-25T12:44:51.162+11:00 | {"level":"INFO","key":"Proxy","msg":"Handling connection from 138.199.33.227:3391"}   2022-10-25T12:44:51.194+11:00 | {"level":"INFO","key":"ConnectionReverse 138.199.33.227:3391","msg":"Starting Connection Reverse"}   2022-10-25T12:45:11.184+11:00 | {"level":"WARN","key":"Proxy","msg":"Failed connection from 138.199.33.227:3391 - ErrorConnectionStartTimeout"}   2022-10-25T12:45:11.184+11:00 | {"level":"INFO","key":"Proxy","msg":"Handled connection from 138.199.33.227:3391"}
CMCDragonkai commented 2 years ago

Seed node connection appears to be working now. So we are proceeding to consolidate our manual tests into the automated tests in PR MatrixAI/Polykey#441.

tegefaulkes commented 2 years ago

I'm getting the same problem as before where the seed node is failing to connect back to me. I've narrowed it down to the VM i'm using since running on the laptop works.

CMCDragonkai commented 2 years ago

Is your VM using natted network or bridge?

tegefaulkes commented 2 years ago

The default switch is using a nat. I also configured it to share the hosts physical adapter and that had the same problem. It might just be a quirk with VMs or the windows hypervisor.

I can test it with the VM i set up on my NAS. AFAIK it has it's own IP address so it's a separate machine so far as my home router cares.

CMCDragonkai commented 2 years ago

Don't bother right now. If it works on the laptop from your home, let's proceed to office to home and double node to testnet.

CMCDragonkai commented 2 years ago

From the office with 2 nodes, we are seeing the signalling operation complete. However upon attempting to hole punch each other (with the same IP address but different ports), while we see these packets get sent out, we are not receiving any of these packets.

This is true even after opening the firewall on the local systems. One would have to assume that the CGNAT in the office is blocking these packets.

At home however, it's not behind a CGNAT, so now we are switching to testing with 2 nodes on from home. Also we used tailscale to connect to the home system, which shows up with a direct connection with tailscale status, this confirms that hole punching can work between office and home, which should imply that home NAT is a normal NAT, and not a CGNAT.


Some additional clarification on CGNAT is required:

Basically hairpinning is technically a solution a problem of internal communication within the same CGNAT. But the reason why CGNAT will not allow hole punching is due to the NAT translation.

The IP and port observed by the seed node is not the same IP and port that is within the CGNAT. If the CGNAT can rewrite the IP and port to be as if it came from the outside IP/port, then it would work, but this is not what the CGNATs with or without hair pinning.

This means we expect that this test to not work on our CGNAT situation, and we can only do this with relaying.

@tegefaulkes I think at this point you just have to merge in the fixes to the signalling mechanism.

And we will need to test on the home network.

CMCDragonkai commented 2 years ago

I noticed on tailscale, if we try to connect to one of the IPs, it recognises that the the system is on the same local network, and instead of going through the larger network, it just uses the direct IP:

100.117.43.4    matrix-win-1         tagged-devices windows active; direct 192.168.1.100:41641, tx 140520 rx 776152

Where 192.168.1.100 is the local IP.

This means it's not even doing hole punching or anything, it's a direct connection.


This is pretty smart. I wonder if it's possible for the seed node to also acquire this information and send this back so that way it's possible for us to know that a local IP should be used instead.

CMCDragonkai commented 2 years ago

Ok home to office connection succeeded. That's NAT to CGNAT.

It's possible even CGNAT to CGNAT could work, but we don't know this until we have 2 CGNATs.

It turns out that it doesn't work if we are in the same network. This is because lots of routers lack hairpinning support including both the office router and the home router.

Without hairpinning, one cannot do hole punching between hosts on the same subnet. This is why our double node to seed node test didn't work. It didn't work in the office, and it didn't work at home either.

This comment https://news.ycombinator.com/item?id=8229792 illustrates some reasons why. Furthermore when comparing to tailscale, they switch to using the local IPs instead of the remote IPs without actually doing any hole punching. Both entail requiring additional logic on the node graph/node connection manager that can take local address data and prioritise it over remote address data. Basically MatrixAI/js-mdns#1 is needed.

In the comment there appears to be a couple solutions:

  1. Use PCP or PMP to configure the router - this again can lack support
  2. Use local discovery "LAN locator beacons" - multicast... etc
  3. Gather all the subnet information and send it to the seed node. The seed node can then send back all possible IPs, and have the client nodes attempt all of them.
  4. Relay as a last resort.

At 3., we have this problem where network information can be leaked. If we are expecting decentralised NAT MatrixAI/Polykey#365, then this basically provides internal network information to random third parties which is not a good situation.

For now, we will prefer solution 2. So if at any point we end up trying to contact a particular node, and it has a local address, it should avoid sending a signalling message at all... I'm not sure if that's possible to know whether a particular node address is local or not. We may tag these addresses as "local" if they were acquired through a LAN discovery process.

CMCDragonkai commented 2 years ago

Still need the post on the logs and full connection after you're ready @tegefaulkes.

CMCDragonkai commented 2 years ago

It's important to see that both seed nodes are capable of also discovering each other, and as well as new nodes when they are entered into the cluster.

CMCDragonkai commented 1 year ago

This can be closed now. We know a couple of things:

  1. It is not possible to connect to a node on the same network without MatrixAI/js-mdns#1. Thus any connection tests from the same network is bound to fail.
  2. Local NAT simulation tests are working now again according to @tegefaulkes in MatrixAI/Polykey#474.
  3. There are still problems with the testnet nodes failing when automated testnet connection tests terminate/finish the agent process.
  4. Network is still flaky and causes timeout errors as per MatrixAI/Polykey#474.
  5. We know that NAT to CGNAT works. And seed nodes can contact each other.

A final test is required involving NAT to CGNAT and the 2 seed nodes together. In total 4 nodes should be tested. However with the amount of failures we're going to blocked on this until we really simplify our networking and RPC code.

CMCDragonkai commented 1 year ago

More sophisticated NAT simulation testing will go to Polykey-Simulation repo.