Open CMCDragonkai opened 3 weeks ago
This along with #198 is definitely due to some sort of resource leak coming out of node to node connections/streams.
The only way we were able to proceed was to delete the entire state of the polykey client node state and restart a new node, which means a new NodeId too.
This bug issue is really focusing on the inter-node behaviour which is quite critical.
However the state reset indicates that there's some corruption of the state... not sure where or what would cause the ErrorRPCTimeout
.... we need a bit more detail over this.
Due to not having access to the corrupted Polykey state or another reliable method to replicate this issue, it is really challenging to pinpoint the issue. This will need an in-depth investigation.
Try testing it with the other team members PK. Don't just do self pull/clone. There's resource leaks in the nodes domain atm anyway.
Also you can always run different versions of PK too you can try to use the nixpkgs pin to different versions and run them or clone them separately.
Describe the bug
When N1 tries to clone/pull the vault, sometimes due to unknown bug, state corruption or something, it causes a
ErrorRPCTimeout
.After a little bit of time, the agent on N2 reports:
TypeError: Invalid state: WritableStream is closed
.This then causes the entire agent to shutdown. I suspect this has common factors with #115, #185, #198.
To Reproduce
["0.10.0","1.14.0","1","1"]
, but it doesn't appear that the version is the problem.["0.13.0","1.15.1","1","1"]
Expected behavior
Regardless of what is happening, I believe the network streams is not properly being garbage collected or handled. It doesn't matter if the client is broken. The agent that is serving the vault SHOULD NOT FAIL.
I'm pretty sure this is similar to #198.
The point is something is causing
ErrorRPCTimeout
, and it seemed to only be fixed through a full state reset. And this implies there's some amount of state corruption that is occurring too.Screenshots
Platform (please complete the following information)
Additional context
198
185
115
Notify maintainers
@tegefaulkes @aryanjassal