Closed CMCDragonkai closed 7 months ago
Here's a basic script I used for the demonstration.
./polykey agent start
. Enter the password if its the first time starting./polykey agent status
./polykey vaults create testvault
../polykey vaults list
.echo somesecret > secret
./polykey secrets create ./secret testvault:secret
./polykey secrets list testvault
./polykey secrets get testvault:secret
echo newsecret > secret
./polykey secrets get testvault:secret
../polykey nodes connections
./polykey nodes getall
./polykey nodes ping <NODEID>
This will find the node within the wider network as well as looking for it in the local network if multicast is enabled../polykey agent start -np ./nodeb
./polykey agent status -np ./nodeb
./polykey agent status
../polykey nodes ping <NODEID>
./polykey identities discover <NodeIdB>
./polykey identities discover -np ./nodeb <NodeIdA>
./polykey identities trust -np ./nodeb <NodeIdA>
./polykey idenitites trust <NodeIdB>
./polykey vaults share testvault <NodeIdB>
./polykey vaults clone -np ./nodeb testvault <NodeIdA>
./polykey vaults list -np ./nodeb
./polykey secrets list testvault -np ./nodeb
./polykey secrets get testvault:secret -np ./nodeb
We cannot abide by the default https://nodejs.org/api/events.html#nodeeventtargetsetmaxlistenersn. We may have legitimate reasons for having lots of event listeners on our event targets.
However we also need to keep track of this count internally. This can be part of our diagnostics domain to avoid internal memory and resource leaks by maintaining counters of all the resources being created.
We may be able to make use of some async monitors for this. It needs to be fast and may be runtime dependent. One thing I'd like to start with is tracking floating promises.
We cannot abide by the default https://nodejs.org/api/events.html#nodeeventtargetsetmaxlistenersn. We may have legitimate reasons for having lots of event listeners on our event targets.
However we also need to keep track of this count internally. This can be part of our diagnostics domain to avoid internal memory and resource leaks by maintaining counters of all the resources being created.
We may be able to make use of some async monitors for this. It needs to be fast and may be runtime dependent. One thing I'd like to start with is tracking floating promises.
A quick solution right now is to raise the number marginally more. What would it need to be to avoid the warnings as of the current state. Are there situations where the deepest callstack may provide us the largest possible number of event listeners?
Also I'm interpreting this as due to our js-contexts usage right? There's no other places where we are dynamically getting lots of event listeners? And this could be a problem if during a large recursion into the same contextified function?
We cannot abide by the default https://nodejs.org/api/events.html#nodeeventtargetsetmaxlistenersn. We may have legitimate reasons for having lots of event listeners on our event targets. However we also need to keep track of this count internally. This can be part of our diagnostics domain to avoid internal memory and resource leaks by maintaining counters of all the resources being created. We may be able to make use of some async monitors for this. It needs to be fast and may be runtime dependent. One thing I'd like to start with is tracking floating promises.
A quick solution right now is to raise the number marginally more. What would it need to be to avoid the warnings as of the current state. Are there situations where the deepest callstack may provide us the largest possible number of event listeners?
Also I'm interpreting this as due to our js-contexts usage right? There's no other places where we are dynamically getting lots of event listeners? And this could be a problem if during a large recursion into the same contextified function?
I have a new issue that ties into this problem at https://github.com/MatrixAI/js-rpc/issues.
Going through the open issues I've picked out the following that still seem relevant here.
findByMDNS
? It would solve some timeout nuance. Something to discussAnd this https://github.com/MatrixAI/Polykey-Docs/pull/33.
@amydevs to review this tomorrow.
Current priorities:
The main things to finish off what is necessary for an online launch before focusing on bugs and features in PK's core.
These 2 should be doable by Christmas.
Things are mostly working with the docker integration tests now except for 1 problem.
One of the tests is failing due to a CLI call being made to an agent failing. It's failing with...
{"level":"ERROR","keys":"polykey.PolykeyClient.WebSocketClient","msg":"ErrorWebSocketConnectionLocal: WebSocket Connection local error - WebSocket could not open due to internal error"}
{"level":"ERROR","keys":"polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0","msg":"ErrorWebSocketConnectionLocal: WebSocket Connection local error - WebSocket could not open due to internal error"}
{"type":"ErrorWebSocketConnectionLocal","data":{"message":"WebSocket could not open due to internal error","timestamp":"2023-12-20T03:55:11.025Z","data":{"errorCode":1011,"reason":"WebSocket could not open due to internal error"},"cause":{"errno":-111,"code":"ECONNREFUSED","syscall":"connect","address":"127.0.0.1","port":36569},"stack":"ErrorWebSocketConnectionLocal: WebSocket could not open due to internal error\n at WebSocket.openErrorHandler (/builds/MatrixAI/open-source/Polykey-CLI/node_modules/@matrixai/ws/src/WebSocketConnection.ts:693:16)\n at Object.onceWrapper (node:events:629:26)\n at WebSocket.emit (node:events:514:28)\n at emitErrorAndClose (/builds/MatrixAI/open-source/Polykey-CLI/node_modules/ws/lib/websocket.js:1016:13)\n at ClientRequest.<anonymous> (/builds/MatrixAI/open-source/Polykey-CLI/node_modules/ws/lib/websocket.js:864:5)\n at ClientRequest.emit (node:events:514:28)\n at TLSSocket.socketErrorListener (node:_http_client:501:9)\n at TLSSocket.emit (node:events:514:28)\n at emitErrorNT (node:internal/streams/destroy:151:8)\n at emitErrorCloseNT (node:internal/streams/destroy:116:3)\n at processTicksAndRejections (node:internal/process/task_queues:82:21)"}}
Specifically the ECONNREFUSED
when making the connection. The agent is running fine here, it works locally so I can only assume it's some weirdness with the DIND networking setup in the CI job.
I'll do a little more digging but this isn't the priority right now. So I may have to shortcut the tests for now. I can see that the connections are being made and the agent isn't crashing so that'll have to be enough for now.
I'm adding https://github.com/MatrixAI/Polykey-CLI/issues/90 to this to track all the changes needed for the CI integration testing
In no particular order, just picking out things to do.
I need to make an issue about tracking down the potential EventTarget
/AbortSignal
handler leak. Right now we get warnings about it but I don't think it's a major issue.
So I've been looking into why the vaults and adding secrets is slow. From what I can tell the problem stems from two parts.
I think we need to add some more bench-marking in EFS to narrow down the performance bottlenecks and try to optimise it.
New issue for the EventTarget
leak at https://github.com/MatrixAI/js-quic/issues/80.
I've found the problem, just needs some discussion.
I noticed that the polykey
ECR image is still just polykey
and it's name hasn't changed to polykey-cli
. Like in:
registry_image='015248367786.dkr.ecr.ap-southeast-2.amazonaws.com/polykey'
@amydevs it should be changed to polykey-cli
for our ECR images since polykey
is going to be reserved for the library stuff.
This would align with the release output names given that we produce polykey-cli
in:
I've also removed the image publishing documentation from README.md
because it's no longer relevant now, it's being moved to Polykey-Docs
under the development guide.
@brynblack you might want to keep awareness about this.
Also I've regenerated the docs, it hadn't been done since this repo was first created. We would probably want to generate the docs automatically in our CI/CD instead of leaving it here in our source to avoid manual creation. It also reveals useful information like broken links too.
Newly deployed docs looks alot better:
@brynblack feel free to give feedback as to whether auto generated docs is useful at all. I don't really read it myself since I just go through the source code.
@tegefaulkes can you add the relevant issues into this epic via zenhub? And then update whether they are in-progress or otherwise. Keep this epic organised.
The conclusion of this issue should update the version of polykey-cli
to 0.X
. So right now we are on 0.1.Y
. So it should be whatever the latest 0.1.Y
we end up with.
I reckon the next version after that will be 1.0.0
.
@tegefaulkes can you add the relevant issues into this epic via zenhub? And then update whether they are in-progress or otherwise. Keep this epic organised.
I've been doing this already. I've been adding the new ones I've created and updated the status for the issues I've been working on. I haven't touched the status for others however.
I've noticed that the integration:prerelease
doesn't run normally. And therefore when we make a release tag, an image is pushed to the GitLab container registry, but it doesn't update the testnet
tag. Which is confusing because the mainnet
tag ends up being "later" then the testnet
tag.
https://gitlab.com/MatrixAI/open-source/Polykey-CLI/container_registry
In my infrastructure planning, I'm making diagram about how our distribution infrastructure works.
Some feedback on the error reporting concept: https://github.com/MatrixAI/js-rpc/issues/57#issuecomment-1913884490.
I think we should swap the cause chain.
Going over the videos and discord comments I've compiled a shortlist of things to address.
Polykey Desktop
. it should be either just Polykey
or Polykey CLI
for the CLI specifically.~ https://github.com/MatrixAI/Polykey-CLI/issues/103vercel/pkg
for https://github.com/yao-pkg/pkg~ https://github.com/MatrixAI/Polykey-CLI/issues/111I'll be creating new issues or adding to existing issues to address these.
We get the following warning when running the CLI
(node:610526) [DEP0112] DeprecationWarning: Socket.prototype._handle is deprecated
(Use `node --trace-deprecation ...` to show where the warning was created)
@amydevs mentioned it's due to MDNS. I think we need to look into removing the warning. I don't see any issues about this however so maybe we should make a new one and look into it?
It seems that MDNS is getting the handle (fd?) of the socket to modify it. I can see us needing direct access to the socket in native code for js-quic
as well for a performance upgrade. So solving this can be applied there as well.
even shorter list:
low low priority
@amy pick from these and address them
Secrets env
command is completed now. I'll be moving on to working on the short short list OR fixing and re-enabling windows and mac CI test jobs
pr MatrixAI/Polykey-Docs#56 should be ready now
I've noticed that the test NodeManager › with peers in network › findNode by signalled connections › handles offline nodes
fails randomly and more often than I'd like. We need to look into it.
We're getting close to completing the main stuff. @amydevs is looking into the issues with discovery we found while testing. I'm thinking I should look into making polykey
more tolerant of network failures and changes. This can tie into https://github.com/MatrixAI/Polykey/issues/461 as well.
I've already made some fixes to js-quic
to handle network problems but it's far from comprehensive. Polykey
itself needs to handle the QUICServer
going down for network reasons. It should even support the network being unavailable for extended periods.
This ties in nicely with https://github.com/MatrixAI/Polykey/issues/461 since we'd need to support not having networking running for extended periods of time, but also support dynamically starting or stopping the QUICServer
whenever we want. Either in respose to the user requesting it or having to restart it when the network changes or comes back up. Also, being able to toggle the QUICServer
at will can be seen as a kind of stealth mode.
I just did some testing with Polykey-CLI
. It seems things are tolerant of this already so the changes I made before seem more than fine for now. I tested this by starting a node and just turning off wifi. I did the expected and timeout after the timeout time. Re-enabling the wifi and pinging a seed node caused it to re-connect to the Polykey network just fine as expect. The process did not crash at any point while doing this. Previous crashing due to network issues may have been fixed when I fixed a race condition relating to this just recently.
That said, there may be corner cases where the socket could just fail for one reason or another. The question is how tolerant do we want to be about this? How does this handle switching between wifi networks or interfaces? If we have two active interfaces and one starts to fail then do we seamlessly switch, or are we just tolerant of that? I suppose all theses questions will bear out with more rigorous user testing.
So based on that quick test I think as things are we're fine and tolerant of network failures in that the process will no crash in these cases. I don't think there's anything to address about that right now but we should keep it in the back of our mind with further testing.
Priorities going forward..
Right now there are problems with discovery
Reviewing the repos I found the following old issues that relate to these problems.
Looks like all points are addressed across this. Some double up. Should we create a single PR addressing all of these? Seems like a decent re-work/update of discovery. All of these are on Polykey
except for point 2 which is both front-end and back-end.
There are 2 performance issues.
We still need to finish up the following active issues
Moving forward we want to focus on any UI/UX issues and bugs we encounter during user testing.
I'm going to make a new epic for tracking discovery problems.
@pablo.padillo this is the main important engineering issue atm.
Status update:
I1
/ \
N1 N2
Even if N1 is destroyed, from the perspective of I1, and N2, the entire gestalt is all 3 vertexes.
Therefore if N3 and I2 exists.
I2
|
N3
And I1...I2 were connected in some way.
Then N3 should be able to discover I2 and I1, and also discover N1 and N2.
Thus N3 should be able to connect to N2 - either via manual discovery or automatic discovery.
The work that @brian.botha and @amydevs is on right now, are all UX oriented features that help automate or reduce the amount of work for the entire demo process.
This includes:
These 3 things may come after version 0.3.1.
So upon release of 0.3.1, remove those other remaining subissues, as I consider the beta release to be done. @pablo.padillo should be redoing the demo with the understanding the more than 1 claim is fine to work with. Subsequent release of PK should be addressing the 3 items posted above.
@pablo.padillo this is the main important engineering issue atm.
Status update:
- Remaining subissues here are not part of CLI beta 30, they can be moved out to be worked on in the backlog.
- The main bug discovered during your testing with externals and internals in relation to having double claims should be fixed. Pending that this is all part of a released PK CLI executable - if it is not, it requires a 0.3.1 release.
- If so, then you must re-attempt another demo run - but this time without bothering with removing your old claims. It's fine to have multiple claims to the same identity from multiple nodes.
I1 / \ N1 N2 Even if N1 is destroyed, from the perspective of I1, and N2, the entire gestalt is all 3 vertexes. Therefore if N3 and I2 exists. I2 | N3 And I1...I2 were connected in some way. Then N3 should be able to discover I2 and I1, and also discover N1 and N2. Thus N3 should be able to connect to N2 - either via manual discovery or automatic discovery.
The work that @brian.botha and @amydevs is on right now, are all UX oriented features that help automate or reduce the amount of work for the entire demo process.
This includes:
- Being able for N3 to share with ANY vertex of I1, so it can share to I1 or N2 or N1, and it achieves the same thing. So that N1 and N2 can both pull/clone the vault. This reduces the demo process steps. - ENG-132
- Background discovery should be automatic between I2 gestalt and I1 gestaslt. So that it should not be needed for N3 to manual-discover of N2/I1. Background discovery should be interpretable and predictable to the first-order - the immediate neighbourhood. - ISSUE?
- Being able to have notifications be completely delay tolerant. - ENG-37 Backgrounding of Notifications Domain Polykey#695
These 3 things may come after version 0.3.1.
https://linear.app/matrix-ai/issue/ENG-298/follow-permission-and-social-links-during-discovery has been created to track point 2. above.
@CryptoTotalWar I think upon closing this issue, we should be planning for a hard launch. You should create a new issue for that and associate subissues for a hard launch. I'm not sure if it should go into Polykey - but we could associate issues to the MatrixAI-Graph for any general work.
NOTES Rough Draft WIP. Will clean up.
Critical Issues
https://github.com/MatrixAI/Polykey/issues/592 Pretty major problem, results in nodes randomly crashing. "General Fixes for connection stability" - important from a network stability standpoint. We don't want the program or network to fail on you.
https://github.com/MatrixAI/Polykey/issues/597 - An oddity and doesn't really break things, but should be looked into. "was producing multiple node certificates when it should have only been producing one" - Minotr bug dont include
https://github.com/MatrixAI/js-rpc/pull/47 https://github.com/MatrixAI/Polykey/issues/588 https://github.com/MatrixAI/Polykey/pull/589 - Critical from a usability standpoint. Otherwise not strictly broken. "increasing the Authentication Timeout window for PK Identities Authenticate - important from a usability standpoint
https://github.com/MatrixAI/Polykey-CLI/issues/44 - Critical from a usability stand point. "General PK CLI polish and fixes" - pretty major (I will say this is a feature enhancement from a usability standpoint).
https://github.com/MatrixAI/js-timer/issues/15 - macro task resource leak. "just a fix where the timers are not being cleaned up properly... a resource leak where they weree lingering in memory when they should have been done cleaned up" - major - resource leaks always a problem and the memory would pile up and crash the node so an improtant fix.
Needed but functional without
https://github.com/MatrixAI/Polykey/issues/605 - Not super big issue, Right now it means more stricter nats can't be punched to. I have an idea for a fix. Not strictly a major error, just something we did not want to be broken. Important for reliability of usage in different network situations.
There was a fix where we weren't taking into account an OLD node ID when authenticating a connection. So when a node renewed its ID (renewed its key) you should still be able to authenticate a connection with an old node but we werent handling that properly. https://github.com/MatrixAI/Polykey/issues/593 --- VERY MINOR
"small fixes to the timeout middleware" https://github.com/MatrixAI/js-rpc/issues/19 https://github.com/MatrixAI/js-rpc/pull/42 - NOt important
A lot of this was just fixing up bugs.
"Major Feature enhancement
we overhauled how we discovered nodes in the network and made connection between each other with nat hole punching in a more decentralized way whereas before it was more centralized where all hole punching went through seed nodes but in this case any node could be used to whole punch and discover other nodes https://github.com/MatrixAI/Polykey/pull/618
we did add the Polykey ENV command which allows injecting secrets into the environment when running a command," which simulate env files in a more secure encrypted manner. https://github.com/MatrixAI/Polykey-CLI/issues/31
Amy did a lot of work related to standardizing the output formats for the CLI. https://github.com/MatrixAI/Polykey-CLI/issues/22
Features WIP
Audit domain - https://github.com/MatrixAI/Polykey-CLI/issues/177
Feedback to discovery to inspect the discovery q but also discovery events as they happen when triggering discount https://github.com/MatrixAI/Polykey-CLI/issues/162 - discovery feedback
0.3.1 has been release a few days ago. I'm going to resolve this issue now.
Connection Stability: Implemented general fixes to enhance network stability, addressing critical issues where nodes could crash unexpectedly. This fix is crucial for ensuring reliability. Issue #592 - Important
Authentication Enhancements: Increased the Authentication Timeout window for PK Identities Authenticate to improve user experience during the login process. PR MatrixAI/Polykey-Docs#47, Issue #588, PR #589 - Important
CLI Usability: Enhancements and polish applied to the Polykey CLI to improve usability. Issue MatrixAI/Polykey-Docs#44 - Important
Resource Leak Fix: Addressed a significant issue where timers were not being properly cleaned, leading to potential memory buildup and crashes. Issue MatrixAI/Polykey-Docs#15 - Important
NAT Hole Punching: Refined our NAT hole punching mechanisms to support more symmetric network configurations, crucial for the reliability of node communications. Issue #605 - Important
Node Discovery Overhaul: Decentralized the node discovery process to enhance network connectivity and reduce reliance on seed nodes. PR #618 - Major Feature Enhancement
Environment Variable Command: Added the Polykey ENV command for secure injection of secrets into the environment, simulating a more secure method of handling environment variables. Issue MatrixAI/Polykey-Docs#31 - Major Feature Enhancement
CLI Output Standardization: Standardized the output formats of the CLI, improving interface consistency across commands. Issue MatrixAI/Polykey-Docs#22 - Important
Audit Domain: Developing an audit domain for better monitoring and tracking of network activities. Issue #177 - WIP
Discovery Feedback: Enhancing feedback mechanisms for node discovery to allow real-time insights into the discovery process. Issue #162 - WIP
This summary will guide the blog post update regarding the CLI enhancements and fixes since the beta release. Please review and confirm the inclusion of these points. - @tegefaulkes
I think that should go to your blog issue? This issue is closed. @CryptoTotalWar
Specification
This epic focuses on the CLI beta launch targeting November 10 2023.
This follows from the 6th testnet deployment https://github.com/MatrixAI/Polykey/issues/551 and focuses on any UX issues that should be resolved before we launch. Plus any documentation related things and metrics for tracking how the launch went, as well as content we need to write to prepare for it.
We should also get our demo video fully polished as well.
One of the things we need to do is:
audit
orlogs
domain to PK at first to track information about the network, we will need this as part of our metrics.testnet.polykey.com
andmainnet.polykey.com
so we can see what's going on and show everybody how the network is building up.We are currently working through this list of issues. https://github.com/MatrixAI/Polykey-CLI/issues/40#issuecomment-1853222990
Additional context
Tasks