Failed `vault clone` or `vault pull` on N1 causes N2 to crash

CMCDragonkai commented 3 weeks ago

Describe the bug

INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2406:da1c:67c:c000:3ed6:cc4b:b6e5:3745:1314].QUICClient.QUICConnection e33f1eb57d2cdb5bdfa6b289aa1d01fd41ed1f47.QUICStream 36:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2406:da1c:67c:c000:3ed6:cc4b:b6e5:3745:1314].QUICClient.QUICConnection e33f1eb57d2cdb5bdfa6b289aa1d01fd41ed1f47.QUICStream 36:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handled stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 45:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 45:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesClosestActiveConnectionsGet)
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handled stream with method (nodesClosestActiveConnectionsGet)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 49:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 53:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:3593:3a22:674f:f33b:1314].QUICClient.QUICConnection e1349462cbcf277d32c1e7ea1dd04cdfbe2684dd.QUICStream 53:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesClosestLocalNodesGet)
TypeError: Invalid state: WritableStream is closed

When N1 tries to clone/pull the vault, sometimes due to unknown bug, state corruption or something, it causes a ErrorRPCTimeout.

After a little bit of time, the agent on N2 reports: TypeError: Invalid state: WritableStream is closed.

This then causes the entire agent to shutdown. I suspect this has common factors with #115, #185, #198.

To Reproduce

This is done with @CDeltakai his version was ["0.10.0","1.14.0","1","1"], but it doesn't appear that the version is the problem.
My agent was running from the staging ["0.13.0","1.15.1","1","1"]
Running a pull/clone of a vault.

Expected behavior

Regardless of what is happening, I believe the network streams is not properly being garbage collected or handled. It doesn't matter if the client is broken. The agent that is serving the vault SHOULD NOT FAIL.

I'm pretty sure this is similar to #198.

The point is something is causing ErrorRPCTimeout, and it seemed to only be fixed through a full state reset. And this implies there's some amount of state corruption that is occurring too.

Screenshots

Platform (please complete the following information)

Device: [e.g. iPhone6]
OS: [e.g. iOS]
Version [e.g. 22]

Additional context

198
185
115

Notify maintainers

@tegefaulkes @aryanjassal

linear[bot] commented 3 weeks ago

ENG-457 Failed Vault Clone/Pull on N1 causes N2 to crash with `TypeError: Invalid state: WritableStream is closed`

CMCDragonkai commented 3 weeks ago

This along with #198 is definitely due to some sort of resource leak coming out of node to node connections/streams.

CMCDragonkai commented 3 weeks ago

The only way we were able to proceed was to delete the entire state of the polykey client node state and restart a new node, which means a new NodeId too.

CMCDragonkai commented 3 weeks ago

This bug issue is really focusing on the inter-node behaviour which is quite critical.

However the state reset indicates that there's some corruption of the state... not sure where or what would cause the ErrorRPCTimeout.... we need a bit more detail over this.

aryanjassal commented 4 days ago

Due to not having access to the corrupted Polykey state or another reliable method to replicate this issue, it is really challenging to pinpoint the issue. This will need an in-depth investigation.

CMCDragonkai commented 4 days ago

Try testing it with the other team members PK. Don't just do self pull/clone. There's resource leaks in the nodes domain atm anyway.

CMCDragonkai commented 4 days ago

Also you can always run different versions of PK too you can try to use the nixpkgs pin to different versions and run them or clone them separately.

MatrixAI / Polykey-CLI