MatrixAI / Polykey-CLI

Polykey CLI - Open Source Decentralized Secret Sharing System for Zero Trust Workflows
https://polykey.com
GNU General Public License v3.0
6 stars 3 forks source link

Failed Vault Clone/Pull on N1 causes N2 to crash with `TypeError: Invalid state: WritableStream is closed` #324

Open CMCDragonkai opened 1 week ago

CMCDragonkai commented 1 week ago

Describe the bug

INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2406:da1c:67c:c000:3ed6:cc4b:b6e5:3745:1314].QUICClient.QUICConnection e33f1eb57d2cdb5bdfa6b289aa1d01fd41ed1f47.QUICStream 36:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2406:da1c:67c:c000:3ed6:cc4b:b6e5:3745:1314].QUICClient.QUICConnection e33f1eb57d2cdb5bdfa6b289aa1d01fd41ed1f47.QUICStream 36:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handled stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 45:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 45:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesClosestActiveConnectionsGet)
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handled stream with method (nodesClosestActiveConnectionsGet)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 53:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 53:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesClosestLocalNodesGet)
TypeError: Invalid state: WritableStream is closed

When N1 tries to clone/pull the vault, sometimes due to unknown bug, state corruption or something, it causes a ErrorRPCTimeout.

After a little bit of time, the agent on N2 reports: TypeError: Invalid state: WritableStream is closed.

This then causes the entire agent to shutdown. I suspect this has common factors with #115, #185, #198.

To Reproduce

  1. This is done with @CDeltakai his version was ["0.10.0","1.14.0","1","1"], but it doesn't appear that the version is the problem.
  2. My agent was running from the staging ["0.13.0","1.15.1","1","1"]
  3. Running a pull/clone of a vault.

Expected behavior

Regardless of what is happening, I believe the network streams is not properly being garbage collected or handled. It doesn't matter if the client is broken. The agent that is serving the vault SHOULD NOT FAIL.

I'm pretty sure this is similar to #198.

The point is something is causing ErrorRPCTimeout, and it seemed to only be fixed through a full state reset. And this implies there's some amount of state corruption that is occurring too.

Screenshots

Platform (please complete the following information)

Additional context

Notify maintainers

@tegefaulkes @aryanjassal

linear[bot] commented 1 week ago

ENG-457 Failed Vault Clone/Pull on N1 causes N2 to crash with `TypeError: Invalid state: WritableStream is closed`

CMCDragonkai commented 1 week ago

This along with #198 is definitely due to some sort of resource leak coming out of node to node connections/streams.

CMCDragonkai commented 1 week ago

The only way we were able to proceed was to delete the entire state of the polykey client node state and restart a new node, which means a new NodeId too.

CMCDragonkai commented 1 week ago

This bug issue is really focusing on the inter-node behaviour which is quite critical.

However the state reset indicates that there's some corruption of the state... not sure where or what would cause the ErrorRPCTimeout.... we need a bit more detail over this.