Closed CMCDragonkai closed 7 months ago
https://github.com/MatrixAI/Polykey/pull/600 was merged incorrectly, https://github.com/MatrixAI/Polykey/pull/601 is being done by @addievo to fix up the additional options of privateKey
and privateKeyPath
and recoveryCode
. Also they would need to be exposed on PolykeyAgent.start
too. See review comments in https://github.com/MatrixAI/Polykey/pull/600
I finished up and merged https://github.com/MatrixAI/Polykey/pull/601. Having a look over the CLI now.
The PolykeyClient
is going to be create/destroy only, I think it's pretty clear that can work.
The change to the GitHubProvider
to using graphql for getting connected identities is now causing occassional failures in the IdentitiesManager.test.ts
. I've seen it on CI and also locally.
FAIL tests/identities/IdentitiesManager.test.ts (10.559 s)
IdentitiesManager
✓ IdentitiesManager readiness (154 ms)
✓ get, set and unset tokens (with seed=1486144507) (1350 ms)
✓ start and stop preserves state (with seed=1486144507) (966 ms)
✓ fresh start deletes all state (with seed=1486144507) (922 ms)
✓ register and unregister providers (103 ms)
✓ using TestProvider (with seed=1486144507) (3075 ms)
✕ handleClaimIdentity (with seed=1486144507) (2837 ms)
● IdentitiesManager › handleClaimIdentity (with seed=1486144507)
Property failed after 1 tests
{ seed: 1486144507, path: "0:0:0:0:0:0:0:0:0:0:0:0:0:3:0:1:1:0:3:0:0:0:0:1:2:1:1:1:2:1", endOnFailure: true }
Counterexample: ["",{"accessToken":" ","refreshToken":"","accessTokenExpiresIn":1698180425,"refreshTokenExpiresIn":0}]
Shrunk 29 time(s)
Got ErrorProviderUnauthenticated: Access token expired
70 | ) {
71 | if (!providerToken.refreshToken) {
> 72 | throw new identitiesErrors.ErrorProviderUnauthenticated(
| ^
73 | 'Access token expired',
74 | );
75 | }
at TestProvider.checkToken (src/identities/Provider.ts:72:15)
at TestProvider.checkToken [as publishClaim] (tests/identities/TestProvider.ts:152:16)
at src/identities/IdentitiesManager.ts:272:39
at constructor_.addClaim (src/sigchain/Sigchain.ts:449:20)
at withF (node_modules/@matrixai/resources/src/utils.ts:24:12)
at constructor_.handleClaimIdentity (src/identities/IdentitiesManager.ts:261:5)
at numRuns (tests/identities/IdentitiesManager.test.ts:361:7)
at AsyncProperty.run (node_modules/fast-check/lib/check/property/AsyncProperty.generic.js:49:28)
Hint: Enable verbose mode in order to have the list of all failing values encountered during the run
at buildError (node_modules/fast-check/lib/check/runner/utils/RunDetailsFormatter.js:131:15)
at asyncThrowIfFailed (node_modules/fast-check/lib/check/runner/utils/RunDetailsFormatter.js:148:11)
@amydevs
The change to the
GitHubProvider
to using graphql for getting connected identities is now causing occassional failures in theIdentitiesManager.test.ts
. I've seen it on CI and also locally.FAIL tests/identities/IdentitiesManager.test.ts (10.559 s) IdentitiesManager ✓ IdentitiesManager readiness (154 ms) ✓ get, set and unset tokens (with seed=1486144507) (1350 ms) ✓ start and stop preserves state (with seed=1486144507) (966 ms) ✓ fresh start deletes all state (with seed=1486144507) (922 ms) ✓ register and unregister providers (103 ms) ✓ using TestProvider (with seed=1486144507) (3075 ms) ✕ handleClaimIdentity (with seed=1486144507) (2837 ms) ● IdentitiesManager › handleClaimIdentity (with seed=1486144507) Property failed after 1 tests { seed: 1486144507, path: "0:0:0:0:0:0:0:0:0:0:0:0:0:3:0:1:1:0:3:0:0:0:0:1:2:1:1:1:2:1", endOnFailure: true } Counterexample: ["",{"accessToken":" ","refreshToken":"","accessTokenExpiresIn":1698180425,"refreshTokenExpiresIn":0}] Shrunk 29 time(s) Got ErrorProviderUnauthenticated: Access token expired 70 | ) { 71 | if (!providerToken.refreshToken) { > 72 | throw new identitiesErrors.ErrorProviderUnauthenticated( | ^ 73 | 'Access token expired', 74 | ); 75 | } at TestProvider.checkToken (src/identities/Provider.ts:72:15) at TestProvider.checkToken [as publishClaim] (tests/identities/TestProvider.ts:152:16) at src/identities/IdentitiesManager.ts:272:39 at constructor_.addClaim (src/sigchain/Sigchain.ts:449:20) at withF (node_modules/@matrixai/resources/src/utils.ts:24:12) at constructor_.handleClaimIdentity (src/identities/IdentitiesManager.ts:261:5) at numRuns (tests/identities/IdentitiesManager.test.ts:361:7) at AsyncProperty.run (node_modules/fast-check/lib/check/property/AsyncProperty.generic.js:49:28) Hint: Enable verbose mode in order to have the list of all failing values encountered during the run at buildError (node_modules/fast-check/lib/check/runner/utils/RunDetailsFormatter.js:131:15) at asyncThrowIfFailed (node_modules/fast-check/lib/check/runner/utils/RunDetailsFormatter.js:148:11)
@amydevs
this is failing because we are taking we are using the falsiness of the values in the statement rather than matching == null
:
public async checkToken(
providerToken: ProviderToken,
identityId?: IdentityId,
): Promise<ProviderToken> {
const now = Math.floor(Date.now() / 1000);
// this will mean that accessTokenExpiresIn = 0 will be false
if (
providerToken.accessTokenExpiresIn &&
providerToken.accessTokenExpiresIn >= now
) {
// this will mean that refreshToken = '' will throw
if (providerToken.refreshToken) {
throw new identitiesErrors.ErrorProviderUnauthenticated(
'Access token expired',
);
}
// this will mean that refreshTokenExpiresIn = 0 does not throw
if (
providerToken.refreshTokenExpiresIn &&
providerToken.refreshTokenExpiresIn >= now
) {
throw new identitiesErrors.ErrorProviderUnauthenticated(
'Refresh token expired',
);
}
return await this.refreshToken(providerToken, identityId);
}
return providerToken;
}
This would mean that a value of 0 for the ExpiresIn
mean that they never expire.
Is this behaviour intended? If so I'll keep that as it is.
I think you found out that 0 means infinity.
We got about 2 weeks, so let's focus hard on getting stability and bugs fixed.
A bunch of different issues have been thrown into this epic. We have to go over prioritisation of which is most important and delegate the tasks. Remember we got a tight timeframe here.
Critical issues
needed but functional without
stretch
Stretch 8. should be done to make sure timeouts are working.
Add a audit or logs domain to PK at first to track information about the network, we will need this as part of our metrics.
No issue for this, but it's really part of https://github.com/MatrixAI/Polykey/issues/599, and we will need that for metrics for the launch, can't launch without some metrics.
We need to solve that https://github.com/MatrixAI/Polykey/issues/598 to ensure that we don't get hit with a hug of death.
We need to solve our memory leaks - all nodes connect to all testnet nodes. We need randomised round-robin for connecting to the testnet.
We can do this synthetically, load test it.
https://github.com/MatrixAI/Polykey/issues/605 - regardless of how strong the strong NAT - it should work either way.
I'm going to focus on fixing https://github.com/MatrixAI/js-timer/issues/15 really quick. It's causing some minor issues across the code bases.
Not sure how to move forward with https://github.com/MatrixAI/js-timer/issues/15, It's an easy fix assuming my assumptions about how it should work is correct. But js-timer
has been updated to ESM. So I can't make a new release we can use. Unless I can release it as 1.x
versions.
Timer is fixed.
For the CLI beta launch to scale, because potentially many people may try PK and thus connect to the testnet, we need to avoid connecting to every seed node in the testnet at the same time. However this means hole punching won't work without a common seed node between source and target. To achieve this we need to revisit the decentralized hole punching problem in https://github.com/MatrixAI/Polykey/issues/365
At the same time due to https://github.com/MatrixAI/Polykey/issues/599 we would likely want to switch to using SRV records, so that A and AAAA records can be reserved for the testnet.polykey.com or mainnet.polykey.com dashboard.
Also for the CLI beta launch to work well, we might want to deploy onto mainnet. OR default our network to testnet for now. One can even throw an error if attempting to connect to mainnet that isn't configured at all atm. We should prefer a smoother environment... so switching to just mainnet is fine. This launch will be the first version of mainnet then.
Task assignments - for the 2 weeks to launch.
@amydevs
Firstly to fix up the timeouts for RPC, and then apply to both the agent service and the github auth timeout issue.
@tegefaulkes
@okneigres
@addievo
@CMCDragonkai review stuff!
Stretch:
currently Polykey agents domain do not use ctx at all, so server-side handlers will never time out. This will need to be done, and tests will need to be written with timeouts in that domain
@okneigres first requirement of PKE is just the testnet/mainnet pages. The second requirement of it running the seed cluster is less important, but would be nice to have as it makes it easier to do complex orchestration.
Almost forgot, but the major thing to solve here is an audit
domain so that we can record high level events and report that to the dashboard.
@okneigres
I can start the work but @amydevs or @tegefaulkes needs to take over to do that. Could be relevant to MatrixAI/Polykey#386.
Solving https://github.com/MatrixAI/Polykey/issues/599 requires switching to using SRV records, and at an intermediate stage we won't have hardcoded trusted seed nodes in our config.ts
in PK, and we can add back in trusted seed nodes after we have CSR capability https://github.com/MatrixAI/Polykey/issues/154
@amydevs eta on MatrixAI/Polykey#50, MatrixAI/Polykey#53, MatrixAI/Polykey#49? They seem to be stacked one after another.
@tegefaulkes it's time to start addressing
Just make sure the majority of the memory leak is contained, we can still continue with a minor leak - lower growth rate, and put those as a later priority.
@amydevs eta on MatrixAI/Polykey#50, MatrixAI/Polykey#53, MatrixAI/Polykey#49? They seem to be stacked one after another.
@CMCDragonkai should be all doable by today, 50 is already merged
10th - status update - bold things are critical
@amy
audit
domain because that's where you're going to be understanding the high level events (that is summarised into metrics).
audit
domain.@tegefaulkes
@addievo
@CMCDragonkai
Remember everything is auto deployed on testnet. WE need to do empirical testing on all of the problems now.
Just a note, while verification logic will handle cert chains in order of leaf -> root. The PEM format should be ordered in root -> leaf. This will need to be checked. @amydevs
I've I have an Ubuntu VM I can access from the tailscale network now. I'm trying to test the linux version of Polykey-CLI
to test hole punching between networks.
Running the program results in the following error..
XYZ@pttest:~$ ./result
pkg/prelude/bootstrap.js:2255
return ancestor.dlopen.apply(process, args);
^
Error: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /tmp/pkg/c65c5bc459c9452b15357f0d369812780597139e731512beb6903949ed560ba3/@matrixai/db/prebuilds/linux-x64/node.napi.node)
at process.dlopen (pkg/prelude/bootstrap.js:2255:28)
at Module._extensions..node (node:internal/modules/cjs/loader:1338:18)
at Module.load (node:internal/modules/cjs/loader:1117:32)
at Module._load (node:internal/modules/cjs/loader:958:12)
at Module.require (node:internal/modules/cjs/loader:1141:19)
at Module.require (pkg/prelude/bootstrap.js:1851:31)
at require (node:internal/modules/cjs/helpers:110:18)
at load (/snapshot/Polykey-CLI/node_modules/@matrixai/db/node_modules/node-gyp-build/index.js:22:10)
at Object.<anonymous> (/snapshot/Polykey-CLI/node_modules/@matrixai/db/dist/native/rocksdb.js:8:46)
at Module._compile (node:internal/modules/cjs/loader:1254:14) {
code: 'ERR_DLOPEN_FAILED'
}
Node.js v18.15.0
It seems to be missing a dynamic library. I don't know if this is a problem with the building/bundling process. It's looking for it in /lib/
so I don't think it would be included in the package.
Is this just something we need to install for Polykey
to run?
See https://chat.openai.com/share/be6a1745-59e7-4e76-b5a4-2f7f02f7f3bc
Non-NixOS suck. That's why nobody uses it.
That is a C++ standard library required by the latest build of rocksdb.
Error: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /tmp/pkg/c65c5bc459c9452b15357f0d369812780597139e731512beb6903949ed560ba3/@matrixai/db/prebuilds/linux-x64/node.napi.node)
Most OS distributions would need to package manage such things, whenever the standard library of C++ is updated. RocksDB is a fast moving project most likely. We don't have resources to deal with people on old machines/VMs.
Do your tests on NixOS preferably.
@amydevs progress update:
was half way through the sensitive replacer for errors, it should be done. Just need to confirm edgecases with tests. Other than that, the first 2 commits of STDOUT standardization are mergable, the 3rd commit is a WIP that I wan't to batch up a bunch of changes in, cos they are largely all have the same things that need to be fixed.
And was still reading through GestaltGraph and Discovery to get a good idea on how to get GestaltID sharing working
@amydevs will need you to start speccing out the audit
domain.
We will need coordination between 3 areas:
There's an EventTarget
leak on an abort signal somewhere.
(node:3176) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
(node:3176) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
(node:3176) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
(node:3176) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
(node:3176) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
I forgot that event targets have limits. We should probably set it to a higher number anyway in js-events because the default limit is quite low for what we are doing. But you think there's a leak? You might want to log out verbosely where all the registrations are, can monkey patch the registration function.
For abort signals we chain them, so the amount of handlers is usually shallow. Even if we don't chain them we usually don't keep adding handlers to the signal. SO Yeah, I think I need to look into this.
@tegefaulkes has closed off alot of the nodes related issues. We should do another test between Austin and Sydney again to confirm.
Also this issue https://github.com/MatrixAI/Polykey-CLI/issues/47 was related to the same ipv4/ipv6 mapping issue. Was this solved too?
If all of this is done, you can help me with https://github.com/MatrixAI/Polykey/issues/365 and any other quality of life problems like you mentioned above.
Based on the current set of outstanding issues, there is some quality of life things to consider here.
@amydevs
The vaults sharing with gestalt id will be put on backburner for now.
@tegefaulkes some quick things to do:
Then onto decentralized hole punching - https://github.com/MatrixAI/Polykey/issues/365 this can be done in 2 separate pieces:
Completed two of the issues. MatrixAI/Polykey#52 is done but I cut out a bunch of changes when rebasing it and I want to make sure if it was needed or not. There are no mention of what these changes were for in the PR or issue and AFAIK the formatting was fixed in a newer RP already.
I can look at the DB change but it's not critical, same for MatrixAI/js-encryptedfs#5. Either of them can be done later.
By Monday the https://github.com/MatrixAI/Polykey/issues/628 should be providing audit RPC calls that can be called by the dashboard service @okneigres. The CLI commands can be added in concurrently by @amydevs
Just finished the meeting. Here is a general outline of what was discussed.
Brian
connectionIdShared
by string cmp is the one selected from the map when using connections
f. All connections time out using the usual mechanismNodeGraph
structure expansion.NodeManager
to handle higher level connection logic and ICE.Oleg
Amy
Addie
If @okneigres gets to AWS deployment please help him @tegefaulkes by pairing up. It can be easier since we have more experience doing nix and docker and AWS.
@amydevs switch to using online running algorithms. Chatgpt this for more information for calculating metric. It does not require running over the entire dataset all the time. See things like running/moving average. L can be streamed.
@addievo the dashboard and website is going to be part of our launch party. It's going to be big and dazzling! We want it to be billboard level.
Monday, we get an update on the situation from https://github.com/MatrixAI/Polykey/issues/599 and https://github.com/MatrixAI/Polykey/pull/618.
I'll be drawing out the plan for the docs.
Making some notes for myself to follow up. @tegefaulkes
'fails to find an unknown node'
due to problem with RPC client timeout.~ - https://github.com/MatrixAI/Polykey-CLI/issues/37NodeManager
.~vaults.test.ts
> should return the vaults names and ids of the remote vault
has a intermittent failure due to the output formatter.nodes getall
but there may be other places where this is broken.~ https://github.com/MatrixAI/Polykey-CLI/issues/78Polykey
. I've also seen the tasks domain deadlocking when stopping once. Potential race condition here?NCM
were reverted during the nodes refactoring. This needs to be fixed.~AbortSignal
is causing this but there may be other instances. This needs to be looked into and fixed.~ - https://github.com/MatrixAI/MatrixAI-Graph/issues/42I need to create issues for problems that need to be solved later.
Ok, so the current main problem with the CLI is that the docker intergration test is failing.
We've discovered that its a combination of a few factors.
Polykey
is deadlocking while stopping when there are active connections.So nodes while testing are connecting to the mainnet and then timing out the tests when stopping.
Moving forward we need to fix the following.
NodeConnectionManager
's fault. I need to check if it's forcing connections to close properly or if we're draining connections properly. I just checked and it's the PolykeyAgent
is doing an unforced stop of the NodeConnectionmanager
. addie@Addie-MB-Server:~$ polykey -np /home/addie/Node2 vaults clone vault2 vcnjo8bhfqhusvcd89u12getdqqiqbvf7avfed2u71udcgdl660fg --verbose
INFO:polykey.PolykeyClient:Creating PolykeyClient
INFO:polykey.PolykeyClient.Session:Creating Session
INFO:polykey.PolykeyClient.Session:Setting session token path to /home/addie/Node2/token
INFO:polykey.PolykeyClient.Session:Starting Session
INFO:polykey.PolykeyClient.Session:Started Session
INFO:polykey.PolykeyClient.Session:Created Session
INFO:polykey.PolykeyClient:Starting PolykeyClient
INFO:polykey.PolykeyClient.WebSocketClient:Create WebSocketClient to ::1:46397
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:Start WebSocketConnection
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:Started WebSocketConnection
INFO:polykey.PolykeyClient.WebSocketClient:Created WebSocketClient to wss://[::1]:46397
INFO:polykey.PolykeyClient:Started PolykeyClient
INFO:polykey.PolykeyClient:Created PolykeyClient
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0.WebSocketStream 0:Start WebSocketStream
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0.WebSocketStream 0:Started WebSocketStream
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0.WebSocketStream 0:Destroy WebSocketStream
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0.WebSocketStream 0:Destroyed WebSocketStream
INFO:polykey.PolykeyClient:Stopping PolykeyClient
INFO:polykey.PolykeyClient.WebSocketClient:Destroy WebSocketClient on wss://[::1]:46397
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:Stop WebSocketConnection
INFO:polykey.PolykeyClient.WebSocketClient:ErrorWebSocketConnectionLocal: WebSocket Connection local error - Locally closed with code 1000
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:ErrorWebSocketConnectionLocal: WebSocket Connection local error - Locally closed with code 1000
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:Stopped WebSocketConnection
INFO:polykey.PolykeyClient.Session:Stopping Session
INFO:polykey.PolykeyClient.Session:Stopped Session
INFO:polykey.PolykeyClient:Stopped PolykeyClient
ErrorRPCTimedOut: RPC has timed out
addie@Addie-MB-Server:~$ polykey -np /home/addie/Node2 vaults scan vcnjo8bhfqhusvcd89u12getdqqiqbvf7avfed2u71udcgdl660fg
vault2 z8WvCNBUwEsrXw9ySPXsuWH pull,clone
Another issue encountered during QA was when a new PK agent was bootstrapped and started but not allowed to start gracefully and closed before first start was completed, this resulted in the agent no longer being able to be started, even after killing the agent or trying to reboot the system, the throw was:
polykey agent start --background
✔ Enter new password … *
✔ Confirm new password … *
ErrorPolykeyCLIAgentProcess: Agent process closed during fork
This was resolved after deleting the np and setting up a new node, this can be detrimental in prod as it renders all the vaults inaccessible
The "closed during fork" error was observed on NixOS specifically and I am unable to reproduce it now, the cloning error is occurring across platforms on NixOS, Ubuntu and MacOS. I am on commit c8607b9aff37bf2187ce638ac4480ada8999faf6.
@tegefaulkes
The problem here is that these commands are failing to make the connection to do their thing. So the connection attempt via findNode
till usually take longer than the 15 second RPC timeout.
So the real problem here is that the RPC
is timing out during long running commands and resulting in a very un-descriptive error. A failure to add a node should take as long as it wants and timeout with a fail to ping or more descriptive error. The same goes for any commands that make a connection.
I'm going to make an issue for this.
Specification
This epic focuses on the CLI beta launch targeting November 10 2023.
This follows from the 6th testnet deployment https://github.com/MatrixAI/Polykey/issues/551 and focuses on any UX issues that should be resolved before we launch. Plus any documentation related things and metrics for tracking how the launch went, as well as content we need to write to prepare for it.
We should also get our demo video fully polished as well.
One of the things we need to do is:
audit
orlogs
domain to PK at first to track information about the network, we will need this as part of our metrics.testnet.polykey.com
andmainnet.polykey.com
so we can see what's going on and show everybody how the network is building up.We are currently working through this list of issues. https://github.com/MatrixAI/Polykey-CLI/issues/40#issuecomment-1853222990
Additional context
Tasks