MatrixAI / Polykey-CLI

Polykey CLI - Open Source Decentralized Secret Sharing System for Zero Trust Workflows
https://polykey.com
GNU General Public License v3.0
6 stars 3 forks source link

CLI Beta Launch #40

Closed CMCDragonkai closed 5 months ago

CMCDragonkai commented 12 months ago

Specification

This epic focuses on the CLI beta launch targeting November 10 2023.

This follows from the 6th testnet deployment https://github.com/MatrixAI/Polykey/issues/551 and focuses on any UX issues that should be resolved before we launch. Plus any documentation related things and metrics for tracking how the launch went, as well as content we need to write to prepare for it.

We should also get our demo video fully polished as well.

One of the things we need to do is:

  1. Add a audit or logs domain to PK at first to track information about the network, we will need this as part of our metrics.
  2. Create a simple dashboard on testnet.polykey.com and mainnet.polykey.com so we can see what's going on and show everybody how the network is building up.

We are currently working through this list of issues. https://github.com/MatrixAI/Polykey-CLI/issues/40#issuecomment-1853222990

Additional context

Tasks

  1. Spec out the experience for the launch
  2. Schedule content releases
  3. Plan out all UX-related issues
CMCDragonkai commented 11 months ago

https://github.com/MatrixAI/Polykey/pull/600 was merged incorrectly, https://github.com/MatrixAI/Polykey/pull/601 is being done by @addievo to fix up the additional options of privateKey and privateKeyPath and recoveryCode. Also they would need to be exposed on PolykeyAgent.start too. See review comments in https://github.com/MatrixAI/Polykey/pull/600

tegefaulkes commented 11 months ago

I finished up and merged https://github.com/MatrixAI/Polykey/pull/601. Having a look over the CLI now.

CMCDragonkai commented 11 months ago

The PolykeyClient is going to be create/destroy only, I think it's pretty clear that can work.

CMCDragonkai commented 11 months ago

The change to the GitHubProvider to using graphql for getting connected identities is now causing occassional failures in the IdentitiesManager.test.ts. I've seen it on CI and also locally.

 FAIL  tests/identities/IdentitiesManager.test.ts (10.559 s)
  IdentitiesManager
    ✓ IdentitiesManager readiness (154 ms)
    ✓ get, set and unset tokens (with seed=1486144507) (1350 ms)
    ✓ start and stop preserves state (with seed=1486144507) (966 ms)
    ✓ fresh start deletes all state (with seed=1486144507) (922 ms)
    ✓ register and unregister providers (103 ms)
    ✓ using TestProvider (with seed=1486144507) (3075 ms)
    ✕ handleClaimIdentity (with seed=1486144507) (2837 ms)

  ● IdentitiesManager › handleClaimIdentity (with seed=1486144507)

    Property failed after 1 tests
    { seed: 1486144507, path: "0:0:0:0:0:0:0:0:0:0:0:0:0:3:0:1:1:0:3:0:0:0:0:1:2:1:1:1:2:1", endOnFailure: true }
    Counterexample: ["",{"accessToken":"          ","refreshToken":"","accessTokenExpiresIn":1698180425,"refreshTokenExpiresIn":0}]
    Shrunk 29 time(s)
    Got ErrorProviderUnauthenticated: Access token expired

      70 |     ) {
      71 |       if (!providerToken.refreshToken) {
    > 72 |         throw new identitiesErrors.ErrorProviderUnauthenticated(
         |               ^
      73 |           'Access token expired',
      74 |         );
      75 |       }

      at TestProvider.checkToken (src/identities/Provider.ts:72:15)
      at TestProvider.checkToken [as publishClaim] (tests/identities/TestProvider.ts:152:16)
      at src/identities/IdentitiesManager.ts:272:39
      at constructor_.addClaim (src/sigchain/Sigchain.ts:449:20)
      at withF (node_modules/@matrixai/resources/src/utils.ts:24:12)
      at constructor_.handleClaimIdentity (src/identities/IdentitiesManager.ts:261:5)
      at numRuns (tests/identities/IdentitiesManager.test.ts:361:7)
      at AsyncProperty.run (node_modules/fast-check/lib/check/property/AsyncProperty.generic.js:49:28)
      Hint: Enable verbose mode in order to have the list of all failing values encountered during the run
      at buildError (node_modules/fast-check/lib/check/runner/utils/RunDetailsFormatter.js:131:15)
      at asyncThrowIfFailed (node_modules/fast-check/lib/check/runner/utils/RunDetailsFormatter.js:148:11)

@amydevs

amydevs commented 11 months ago

The change to the GitHubProvider to using graphql for getting connected identities is now causing occassional failures in the IdentitiesManager.test.ts. I've seen it on CI and also locally.

 FAIL  tests/identities/IdentitiesManager.test.ts (10.559 s)
  IdentitiesManager
    ✓ IdentitiesManager readiness (154 ms)
    ✓ get, set and unset tokens (with seed=1486144507) (1350 ms)
    ✓ start and stop preserves state (with seed=1486144507) (966 ms)
    ✓ fresh start deletes all state (with seed=1486144507) (922 ms)
    ✓ register and unregister providers (103 ms)
    ✓ using TestProvider (with seed=1486144507) (3075 ms)
    ✕ handleClaimIdentity (with seed=1486144507) (2837 ms)

  ● IdentitiesManager › handleClaimIdentity (with seed=1486144507)

    Property failed after 1 tests
    { seed: 1486144507, path: "0:0:0:0:0:0:0:0:0:0:0:0:0:3:0:1:1:0:3:0:0:0:0:1:2:1:1:1:2:1", endOnFailure: true }
    Counterexample: ["",{"accessToken":"          ","refreshToken":"","accessTokenExpiresIn":1698180425,"refreshTokenExpiresIn":0}]
    Shrunk 29 time(s)
    Got ErrorProviderUnauthenticated: Access token expired

      70 |     ) {
      71 |       if (!providerToken.refreshToken) {
    > 72 |         throw new identitiesErrors.ErrorProviderUnauthenticated(
         |               ^
      73 |           'Access token expired',
      74 |         );
      75 |       }

      at TestProvider.checkToken (src/identities/Provider.ts:72:15)
      at TestProvider.checkToken [as publishClaim] (tests/identities/TestProvider.ts:152:16)
      at src/identities/IdentitiesManager.ts:272:39
      at constructor_.addClaim (src/sigchain/Sigchain.ts:449:20)
      at withF (node_modules/@matrixai/resources/src/utils.ts:24:12)
      at constructor_.handleClaimIdentity (src/identities/IdentitiesManager.ts:261:5)
      at numRuns (tests/identities/IdentitiesManager.test.ts:361:7)
      at AsyncProperty.run (node_modules/fast-check/lib/check/property/AsyncProperty.generic.js:49:28)
      Hint: Enable verbose mode in order to have the list of all failing values encountered during the run
      at buildError (node_modules/fast-check/lib/check/runner/utils/RunDetailsFormatter.js:131:15)
      at asyncThrowIfFailed (node_modules/fast-check/lib/check/runner/utils/RunDetailsFormatter.js:148:11)

@amydevs

this is failing because we are taking we are using the falsiness of the values in the statement rather than matching == null:

public async checkToken(
    providerToken: ProviderToken,
    identityId?: IdentityId,
  ): Promise<ProviderToken> {
    const now = Math.floor(Date.now() / 1000);
    // this will mean that accessTokenExpiresIn = 0 will be false
    if (
      providerToken.accessTokenExpiresIn &&
      providerToken.accessTokenExpiresIn >= now
    ) {
      // this will mean that refreshToken = '' will throw
      if (providerToken.refreshToken) {
        throw new identitiesErrors.ErrorProviderUnauthenticated(
          'Access token expired',
        );
      }
      // this will mean that refreshTokenExpiresIn = 0 does not throw
      if (
        providerToken.refreshTokenExpiresIn &&
        providerToken.refreshTokenExpiresIn >= now
      ) {
        throw new identitiesErrors.ErrorProviderUnauthenticated(
          'Refresh token expired',
        );
      }
      return await this.refreshToken(providerToken, identityId);
    }
    return providerToken;
  }

This would mean that a value of 0 for the ExpiresInmean that they never expire.

Is this behaviour intended? If so I'll keep that as it is.

CMCDragonkai commented 11 months ago

I think you found out that 0 means infinity.

CMCDragonkai commented 11 months ago

We got about 2 weeks, so let's focus hard on getting stability and bugs fixed.

CMCDragonkai commented 11 months ago

A bunch of different issues have been thrown into this epic. We have to go over prioritisation of which is most important and delegate the tasks. Remember we got a tight timeframe here.

tegefaulkes commented 11 months ago

Critical issues

  1. https://github.com/MatrixAI/Polykey/issues/592 Pretty major problem, results in nodes randomly crashing.
  2. https://github.com/MatrixAI/Polykey/issues/597 - An oddity and doesn't really break things, but should be looked into.
  3. https://github.com/MatrixAI/js-rpc/issues/47 https://github.com/MatrixAI/Polykey/issues/588 https://github.com/MatrixAI/Polykey/issues/589 - Critical from a usability standpoint. Otherwise not strictly broken.
  4. https://github.com/MatrixAI/Polykey-CLI/issues/44 - Critical from a usability stand point.
  5. https://github.com/MatrixAI/js-timer/issues/15 - macro task resource leak.

needed but functional without

  1. https://github.com/MatrixAI/Polykey/issues/598 - Performance problem, Something to address sooner than later but things like this take a fair amount of work to track down the source.
  2. https://github.com/MatrixAI/Polykey/issues/605 - Not super big issue, Right now it means more stricter nats can't be punched to. I have an idea for a fix.
  3. MatrixAI/js-encryptedfs#5 - Helps cud
  4. https://github.com/MatrixAI/Polykey/issues/572
  5. https://github.com/MatrixAI/Polykey/issues/593
  6. MatrixAI/Polykey#37
  7. https://github.com/MatrixAI/Polykey/issues/594, MatrixAI/Polykey#38
  8. https://github.com/MatrixAI/Polykey/issues/608
  9. MatrixAI/Polykey#22
  10. MatrixAI/Polykey#518 - Is this even still a bug? I think we would've seen it during testnet testing
  11. https://github.com/MatrixAI/Polykey/issues/564 - has been expanded since this issue was created. Just needs a look over.
  12. https://github.com/MatrixAI/js-rpc/issues/19 https://github.com/MatrixAI/js-rpc/issues/42

stretch

  1. MatrixAI/Polykey#36
  2. https://github.com/MatrixAI/Polykey/issues/599
  3. MatrixAI/Polykey#30
  4. https://github.com/MatrixAI/Polykey/issues/237
  5. https://github.com/MatrixAI/Polykey-CLI/issues/31
  6. MatrixAI/Polykey#6
  7. https://github.com/MatrixAI/Polykey-CLI/issues/71
CMCDragonkai commented 11 months ago

Stretch 8. should be done to make sure timeouts are working.

Add a audit or logs domain to PK at first to track information about the network, we will need this as part of our metrics.

No issue for this, but it's really part of https://github.com/MatrixAI/Polykey/issues/599, and we will need that for metrics for the launch, can't launch without some metrics.

CMCDragonkai commented 11 months ago

We need to solve that https://github.com/MatrixAI/Polykey/issues/598 to ensure that we don't get hit with a hug of death.

We need to solve our memory leaks - all nodes connect to all testnet nodes. We need randomised round-robin for connecting to the testnet.

We can do this synthetically, load test it.

CMCDragonkai commented 11 months ago

https://github.com/MatrixAI/Polykey/issues/605 - regardless of how strong the strong NAT - it should work either way.

tegefaulkes commented 11 months ago

I'm going to focus on fixing https://github.com/MatrixAI/js-timer/issues/15 really quick. It's causing some minor issues across the code bases.

tegefaulkes commented 11 months ago

Not sure how to move forward with https://github.com/MatrixAI/js-timer/issues/15, It's an easy fix assuming my assumptions about how it should work is correct. But js-timer has been updated to ESM. So I can't make a new release we can use. Unless I can release it as 1.x versions.

CMCDragonkai commented 11 months ago

Timer is fixed.

CMCDragonkai commented 11 months ago

For the CLI beta launch to scale, because potentially many people may try PK and thus connect to the testnet, we need to avoid connecting to every seed node in the testnet at the same time. However this means hole punching won't work without a common seed node between source and target. To achieve this we need to revisit the decentralized hole punching problem in https://github.com/MatrixAI/Polykey/issues/365

At the same time due to https://github.com/MatrixAI/Polykey/issues/599 we would likely want to switch to using SRV records, so that A and AAAA records can be reserved for the testnet.polykey.com or mainnet.polykey.com dashboard.

Also for the CLI beta launch to work well, we might want to deploy onto mainnet. OR default our network to testnet for now. One can even throw an error if attempting to connect to mainnet that isn't configured at all atm. We should prefer a smoother environment... so switching to just mainnet is fine. This launch will be the first version of mainnet then.

CMCDragonkai commented 11 months ago

Task assignments - for the 2 weeks to launch.

@amydevs

Firstly to fix up the timeouts for RPC, and then apply to both the agent service and the github auth timeout issue.

@tegefaulkes

@okneigres

@addievo

@CMCDragonkai review stuff!

Stretch:

amydevs commented 11 months ago

currently Polykey agents domain do not use ctx at all, so server-side handlers will never time out. This will need to be done, and tests will need to be written with timeouts in that domain

CMCDragonkai commented 11 months ago

@okneigres first requirement of PKE is just the testnet/mainnet pages. The second requirement of it running the seed cluster is less important, but would be nice to have as it makes it easier to do complex orchestration.

CMCDragonkai commented 11 months ago

Almost forgot, but the major thing to solve here is an audit domain so that we can record high level events and report that to the dashboard.

@okneigres

I can start the work but @amydevs or @tegefaulkes needs to take over to do that. Could be relevant to MatrixAI/Polykey#386.

CMCDragonkai commented 11 months ago

Solving https://github.com/MatrixAI/Polykey/issues/599 requires switching to using SRV records, and at an intermediate stage we won't have hardcoded trusted seed nodes in our config.ts in PK, and we can add back in trusted seed nodes after we have CSR capability https://github.com/MatrixAI/Polykey/issues/154

CMCDragonkai commented 11 months ago

@amydevs eta on MatrixAI/Polykey#50, MatrixAI/Polykey#53, MatrixAI/Polykey#49? They seem to be stacked one after another.

CMCDragonkai commented 11 months ago

@tegefaulkes it's time to start addressing

image

Just make sure the majority of the memory leak is contained, we can still continue with a minor leak - lower growth rate, and put those as a later priority.

amydevs commented 11 months ago

@amydevs eta on MatrixAI/Polykey#50, MatrixAI/Polykey#53, MatrixAI/Polykey#49? They seem to be stacked one after another.

@CMCDragonkai should be all doable by today, 50 is already merged

CMCDragonkai commented 11 months ago

10th - status update - bold things are critical

@amy

@tegefaulkes

@addievo

@CMCDragonkai

CMCDragonkai commented 11 months ago

Remember everything is auto deployed on testnet. WE need to do empirical testing on all of the problems now.

tegefaulkes commented 11 months ago

Just a note, while verification logic will handle cert chains in order of leaf -> root. The PEM format should be ordered in root -> leaf. This will need to be checked. @amydevs

tegefaulkes commented 11 months ago

I've I have an Ubuntu VM I can access from the tailscale network now. I'm trying to test the linux version of Polykey-CLI to test hole punching between networks.

Running the program results in the following error..

XYZ@pttest:~$ ./result
pkg/prelude/bootstrap.js:2255
    return ancestor.dlopen.apply(process, args);
                           ^

Error: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /tmp/pkg/c65c5bc459c9452b15357f0d369812780597139e731512beb6903949ed560ba3/@matrixai/db/prebuilds/linux-x64/node.napi.node)
    at process.dlopen (pkg/prelude/bootstrap.js:2255:28)
    at Module._extensions..node (node:internal/modules/cjs/loader:1338:18)
    at Module.load (node:internal/modules/cjs/loader:1117:32)
    at Module._load (node:internal/modules/cjs/loader:958:12)
    at Module.require (node:internal/modules/cjs/loader:1141:19)
    at Module.require (pkg/prelude/bootstrap.js:1851:31)
    at require (node:internal/modules/cjs/helpers:110:18)
    at load (/snapshot/Polykey-CLI/node_modules/@matrixai/db/node_modules/node-gyp-build/index.js:22:10)
    at Object.<anonymous> (/snapshot/Polykey-CLI/node_modules/@matrixai/db/dist/native/rocksdb.js:8:46)
    at Module._compile (node:internal/modules/cjs/loader:1254:14) {
  code: 'ERR_DLOPEN_FAILED'
}

Node.js v18.15.0

It seems to be missing a dynamic library. I don't know if this is a problem with the building/bundling process. It's looking for it in /lib/ so I don't think it would be included in the package.

Is this just something we need to install for Polykey to run?

CMCDragonkai commented 11 months ago

See https://chat.openai.com/share/be6a1745-59e7-4e76-b5a4-2f7f02f7f3bc

Non-NixOS suck. That's why nobody uses it.

CMCDragonkai commented 11 months ago

That is a C++ standard library required by the latest build of rocksdb.

Error: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /tmp/pkg/c65c5bc459c9452b15357f0d369812780597139e731512beb6903949ed560ba3/@matrixai/db/prebuilds/linux-x64/node.napi.node)

Most OS distributions would need to package manage such things, whenever the standard library of C++ is updated. RocksDB is a fast moving project most likely. We don't have resources to deal with people on old machines/VMs.

Do your tests on NixOS preferably.

CMCDragonkai commented 11 months ago

@amydevs progress update:

was half way through the sensitive replacer for errors, it should be done. Just need to confirm edgecases with tests. Other than that, the first 2 commits of STDOUT standardization are mergable, the 3rd commit is a WIP that I wan't to batch up a bunch of changes in, cos they are largely all have the same things that need to be fixed.

And was still reading through GestaltGraph and Discovery to get a good idea on how to get GestaltID sharing working

CMCDragonkai commented 11 months ago

@amydevs will need you to start speccing out the audit domain.

CMCDragonkai commented 11 months ago

image

We will need coordination between 3 areas:

  1. The dashboard testnet/mainnet service @okneigres
  2. The polykey.com/api CF worker that has to call the testnet/mainnet service to get the data as a static summary @addievo
  3. The audit domain of each PK agent that then provides that information @amydevs
tegefaulkes commented 11 months ago

There's an EventTarget leak on an abort signal somewhere.

(node:3176) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
(node:3176) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
(node:3176) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
(node:3176) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
(node:3176) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. Use events.setMaxListeners() to increase limit
CMCDragonkai commented 11 months ago

I forgot that event targets have limits. We should probably set it to a higher number anyway in js-events because the default limit is quite low for what we are doing. But you think there's a leak? You might want to log out verbosely where all the registrations are, can monkey patch the registration function.

tegefaulkes commented 11 months ago

For abort signals we chain them, so the amount of handlers is usually shallow. Even if we don't chain them we usually don't keep adding handlers to the signal. SO Yeah, I think I need to look into this.

CMCDragonkai commented 11 months ago

@tegefaulkes has closed off alot of the nodes related issues. We should do another test between Austin and Sydney again to confirm.

Also this issue https://github.com/MatrixAI/Polykey-CLI/issues/47 was related to the same ipv4/ipv6 mapping issue. Was this solved too?

If all of this is done, you can help me with https://github.com/MatrixAI/Polykey/issues/365 and any other quality of life problems like you mentioned above.

CMCDragonkai commented 11 months ago

Based on the current set of outstanding issues, there is some quality of life things to consider here.

@amydevs

The vaults sharing with gestalt id will be put on backburner for now.

@tegefaulkes some quick things to do:

Then onto decentralized hole punching - https://github.com/MatrixAI/Polykey/issues/365 this can be done in 2 separate pieces:

tegefaulkes commented 11 months ago

Completed two of the issues. MatrixAI/Polykey#52 is done but I cut out a bunch of changes when rebasing it and I want to make sure if it was needed or not. There are no mention of what these changes were for in the PR or issue and AFAIK the formatting was fixed in a newer RP already.

I can look at the DB change but it's not critical, same for MatrixAI/js-encryptedfs#5. Either of them can be done later.

CMCDragonkai commented 11 months ago

By Monday the https://github.com/MatrixAI/Polykey/issues/628 should be providing audit RPC calls that can be called by the dashboard service @okneigres. The CLI commands can be added in concurrently by @amydevs

tegefaulkes commented 11 months ago

Just finished the meeting. Here is a general outline of what was discussed.

Brian

  1. Working on the Decentralised hole punching a. NCM mostly done b. NCM now focuses on holding and making the connections c. there are 2 styles, direct connection on host and port, or hole punch with signalling d. The Connection map now can hold multiple connections per target node. e. The connection with the 'lowest' connectionIdShared by string cmp is the one selected from the map when using connections f. All connections time out using the usual mechanism
  2. Possibly blocked on NodeGraph structure expansion.
  3. Moving on to NodeManager to handle higher level connection logic and ICE.

Oleg

  1. He has an API for mapping IP to co-ords on the map
  2. We could use a basic vector graphic for the world map.
  3. Main things that are left are to put it all together and deploy.
  4. TODO a. Resolving some issues building with nix. b. implement some template for the page c. All the components exist but just needs to put everything together and deploy it.
  5. blockers a. No real blockers right now. SO long as building and deployment proceed smoothly.

Amy

  1. Audit domain, getting audit events are fully type safe
  2. But the RPC client can't really handle types right now unless you wrap the generic callers.
  3. Seems that the async generators @ready decorator doesn't properly work? need to look into that.
  4. Right now the getAuditEvents and getAuditEventsLongRunning are separate. I thin from an API perspective we want these to be the same command where getting ascending with no limit will extend to long-running and future events.
  5. metrics right now are implemented by iterating over the records, or in this case, counting between a range. It's not ideal for performance. I think we'd want to calculate running metrics on insertion. Something to look into.
  6. TODO a. Just need to get RPC types working with the system. b. Handle tracking of generators for if audit domain stops so it can throw. c. Also enable the long-running generators to just end with a signal but without error.
  7. blockers a. None at the moment.

Addie

  1. making progress, should be able to publish by EOD or tomorrow.
  2. Getting started on the map, can start frame-working the front-end with dummy data.
  3. TODO a. Working on and further fleshing out CSS for website. Done by today or tomorrow. b. Working on map visualisation. Should take a few hours
  4. Blockers a. Blocked by back-end API for the map data. But can move forward using dummy data.
CMCDragonkai commented 11 months ago

If @okneigres gets to AWS deployment please help him @tegefaulkes by pairing up. It can be easier since we have more experience doing nix and docker and AWS.

@amydevs switch to using online running algorithms. Chatgpt this for more information for calculating metric. It does not require running over the entire dataset all the time. See things like running/moving average. L can be streamed.

@addievo the dashboard and website is going to be part of our launch party. It's going to be big and dazzling! We want it to be billboard level.

CMCDragonkai commented 10 months ago

Monday, we get an update on the situation from https://github.com/MatrixAI/Polykey/issues/599 and https://github.com/MatrixAI/Polykey/pull/618.

I'll be drawing out the plan for the docs.

tegefaulkes commented 10 months ago

Making some notes for myself to follow up. @tegefaulkes

  1. ~find.test.ts has 1 test disabled 'fails to find an unknown node' due to problem with RPC client timeout.~ - https://github.com/MatrixAI/Polykey-CLI/issues/37
  2. Need to write a summary of what has changed with the nodes refactoring to the decentralized signalling issue.
  3. ~Remove the ID suffix from the handler names in NodeManager.~
  4. vaults.test.ts > should return the vaults names and ids of the remote vault has a intermittent failure due to the output formatter.
  5. ~we need some way to moderate number of connections to seed nodes. Currently network entry tries to connect to all seed nodes. In reality, they all should be added try the 3 closest or something~ https://github.com/MatrixAI/Polykey/issues/648
  6. ~retryAuthentication utility isn't being used properly with server streaming... I fixed this with nodes getall but there may be other places where this is broken.~ https://github.com/MatrixAI/Polykey-CLI/issues/78
  7. random failures in task domain tests in Polykey. I've also seen the tasks domain deadlocking when stopping once. Potential race condition here?
  8. ~I don't think the periodic network connections re-try is kicking in. Need to double check this. It should be doing something like a network entry lite periodically.~
  9. ~audit events for NCM were reverted during the nodes refactoring. This needs to be fixed.~
  10. ~When running a node in the foreground. We can sigInt it and it responds, but it never finishes stopping. This needs to be looked into.~
  11. ~We're getting memory leak warning about adding event listeners to event targets. An AbortSignal is causing this but there may be other instances. This needs to be looked into and fixed.~ - https://github.com/MatrixAI/MatrixAI-Graph/issues/42

I need to create issues for problems that need to be solved later.

tegefaulkes commented 10 months ago

Ok, so the current main problem with the CLI is that the docker intergration test is failing.

We've discovered that its a combination of a few factors.

  1. The bug where Polykey is deadlocking while stopping when there are active connections.
  2. Agents are connecting to the mainnet now while testing. This is happening since there are now deployed nodes on the mainnet.

So nodes while testing are connecting to the mainnet and then timing out the tests when stopping.

Moving forward we need to fix the following.

  1. Integration tests attempting connections to the network is intentional. But they should connect to the testnet instead. All CLi tests when starting an agent need to specify the testnet as the network.
  2. We need to fix the bug where the agent deadlocks while stopping. This is likely the NodeConnectionManager's fault. I need to check if it's forcing connections to close properly or if we're draining connections properly. I just checked and it's the PolykeyAgent is doing an unforced stop of the NodeConnectionmanager.
addievo commented 10 months ago
  1. Cloning is broken as of 8/12. Was working on 7/12. RPC times out when trying to clone
addie@Addie-MB-Server:~$ polykey -np /home/addie/Node2 vaults clone vault2 vcnjo8bhfqhusvcd89u12getdqqiqbvf7avfed2u71udcgdl660fg --verbose
INFO:polykey.PolykeyClient:Creating PolykeyClient
INFO:polykey.PolykeyClient.Session:Creating Session
INFO:polykey.PolykeyClient.Session:Setting session token path to /home/addie/Node2/token
INFO:polykey.PolykeyClient.Session:Starting Session
INFO:polykey.PolykeyClient.Session:Started Session
INFO:polykey.PolykeyClient.Session:Created Session
INFO:polykey.PolykeyClient:Starting PolykeyClient
INFO:polykey.PolykeyClient.WebSocketClient:Create WebSocketClient to ::1:46397
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:Start WebSocketConnection
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:Started WebSocketConnection
INFO:polykey.PolykeyClient.WebSocketClient:Created WebSocketClient to wss://[::1]:46397
INFO:polykey.PolykeyClient:Started PolykeyClient
INFO:polykey.PolykeyClient:Created PolykeyClient
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0.WebSocketStream 0:Start WebSocketStream
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0.WebSocketStream 0:Started WebSocketStream
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0.WebSocketStream 0:Destroy WebSocketStream
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0.WebSocketStream 0:Destroyed WebSocketStream
INFO:polykey.PolykeyClient:Stopping PolykeyClient
INFO:polykey.PolykeyClient.WebSocketClient:Destroy WebSocketClient on wss://[::1]:46397
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:Stop WebSocketConnection
INFO:polykey.PolykeyClient.WebSocketClient:ErrorWebSocketConnectionLocal: WebSocket Connection local error - Locally closed with code 1000
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:ErrorWebSocketConnectionLocal: WebSocket Connection local error - Locally closed with code 1000
INFO:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:Stopped WebSocketConnection
INFO:polykey.PolykeyClient.Session:Stopping Session
INFO:polykey.PolykeyClient.Session:Stopped Session
INFO:polykey.PolykeyClient:Stopped PolykeyClient
ErrorRPCTimedOut: RPC has timed out

The node has adequate permissions to clone

addie@Addie-MB-Server:~$ polykey -np /home/addie/Node2 vaults scan vcnjo8bhfqhusvcd89u12getdqqiqbvf7avfed2u71udcgdl660fg
vault2  z8WvCNBUwEsrXw9ySPXsuWH pull,clone
addievo commented 10 months ago

Another issue encountered during QA was when a new PK agent was bootstrapped and started but not allowed to start gracefully and closed before first start was completed, this resulted in the agent no longer being able to be started, even after killing the agent or trying to reboot the system, the throw was:

 polykey agent start --background
✔ Enter new password … *
✔ Confirm new password … *
ErrorPolykeyCLIAgentProcess: Agent process closed during fork

This was resolved after deleting the np and setting up a new node, this can be detrimental in prod as it renders all the vaults inaccessible

addievo commented 10 months ago

The "closed during fork" error was observed on NixOS specifically and I am unable to reproduce it now, the cloning error is occurring across platforms on NixOS, Ubuntu and MacOS. I am on commit c8607b9aff37bf2187ce638ac4480ada8999faf6.

addievo commented 10 months ago

@tegefaulkes

tegefaulkes commented 10 months ago

The problem here is that these commands are failing to make the connection to do their thing. So the connection attempt via findNode till usually take longer than the 15 second RPC timeout.

So the real problem here is that the RPC is timing out during long running commands and resulting in a very un-descriptive error. A failure to add a node should take as long as it wants and timeout with a fail to ping or more descriptive error. The same goes for any commands that make a connection.

I'm going to make an issue for this.