Open CryptoTotalWar opened 6 months ago
I think I've seen this while testing recently as well. So it's a certain condition that can happen after running for a while that causes it. That makes it very hard to pin down.
I suspect that while the pkAgent.stop()
is called, and the NodeConnectionManager
is stopping with force. It might be possible that we are still handling incoming connections and RPC requests despite the stopping state.
So I'd check this first. Write a test that fires a lot of connections and RPC requests at a NodeConnectionManager
and trigger a NodeConnectionManager.stop()
with force set to true and false and just see what happens.
Right now I'm basing this off the fact that I caught it doing it in the act. I had a agent running with verbose for a while and triggered it to stop. It entered the stopping
state, however it was still serving incoming connections and streams. If we enter a stopping state, forced or not, we should reject all new connections and streams that are attempted.
I tried approaching this from the outside in. Where I added a bunch of logging and debugging for testing and just waiting for the problem to happen so I can catch it in the act. This isn't working very well...
The running theory is that connection activity is causing a race condition that deadlocks the stopping procedure for networking. It's hard to say exactly what and where this is happening. So I have to work from the bottom up fixing up any potential problems that would cause it.
The usual suspects are the following libraries. This will create macrotasks and if any of them leak then we will fail to close the process after stopping. Also they could potentially deadlock when cleaning up.
js-quic
, Holds a bunch of macro tasks and resources. If it fails to clean up properly then it will deadlock.js-rpc
, Similar but is a layer between transport and application code. Lifecycle is determined by the streams it consumes so it seems unlikely to deadlock from that. However it does maintain macrotasks in the form of timers.js-mds
, Similar to js-quic
in that it maintains a socket and timers. js-ws
, Also maintains a socket and timers.On top of the libraries, we have the the following domains that maintain connection lifecycles. These could potentially deadlock when cleaning up.
client
nodes
The main bit of evidence I have so far is...
agent stop
to trigger stopping.So knowing this, the most likely suspect is that endless RPC calls are preventing the connection from stopping fully and deadlocking stop. To fix this, I need to add draining state handling to the client
and nodes
domain, along with adding it to each of the transport libraries we have. I can't guarantee that it will fully fix the issue but at the very least it will cross that potential problem off as the cause. So moving forward.
NodeConnectionManager
domain and add logic that prevents accepting or creating new connections when stopping. Creating connections should already be prevented but we need to check reverse connections.NodeConnection
s need to reject new streams when in the stopping state. Creating new streams should already be prevented, but we need to prevent reverse streams.client
domain for any potential connection or stream leaks. Also add the same draining state logic that rejects new connections and streams.From here we can work our way downwards as needed for the libraries.
js-quic
needs draining state support. We need to reject new streams and connections, both forward and reverse when we're stopping. js-ws
needs the same treatment.js-rpc
doesn't create any connections but timeout timers might leak.js-mdns
I haven't worked on this directly, but it's usage of macrotasks can hold the process open. so we need to investigate it.This is a fair amount of polish work that may not even solve the issue. But it's all stuff I've been meaning to do eventually. May as well get it done now.
As per the comment https://github.com/MatrixAI/Polykey-CLI/issues/157#issuecomment-2294889589, I'm not a fan of the Stopping Agent
STDERR message, it seems unnecessary, and I don't see how it helps in debugging anything. If the prompt is not returned to the terminal, then stopping is broken. And in every case where the agent is running automatically it should be using --verbose
anyway.
In the ideal case, the agent should stop perfectly, and also no need for this 3rd case message. In the unideal case, you just have to add an extra concurrent timer function using js-timer that runs, and reports a warning that the agent cannot stop - this should result in a trace of what exactly is holding it. In the case of nodejs, remember as a JS runtime, there's only 2 things that can hold the process open - any open IO fds, or infinite loops. And if we are able to run the concurrent function, then there can be no infinite loops. Then the only thing to trace is open FDs.
Node as a runtime, does not want to stay running. In fact we previously had an issue https://github.com/MatrixAI/Polykey/issues/307 where node would sometimes just stop running, and we had no idea why. That turned out to be due to "promise deadlocks", and in fact we had an issue to try and trace promise deadlocks live, using the new async hooks api https://nodejs.org/api/async_hooks.html, we ended up not using it, but it's an important API for any concurrent trace debugging https://github.com/MatrixAI/js-logger/issues/15, which we eventually want to collect together into a diagnostics
domain: https://github.com/MatrixAI/Polykey/issues/635
Whereas debugging the opposite is supposed to MUCH easier. As you can see by this SO issue: https://stackoverflow.com/questions/26057328/node-js-inspect-whats-left-in-the-event-loop-thats-preventing-the-script-from.
So going forward:
Stopping Agent
message that was introduced from #157.If a graceful exit is halted by leaked connections. Then this is likely connected to #198 which appears to potentially be related to a remotely leaked connection that ends up causing a ungraceful exit of the agent.
- Get rid of that Stopping Agent message that was introduced from https://github.com/MatrixAI/Polykey-CLI/issues/157.
This can just be replaced by raising the info level to warning level for PolykeyAgent:Stopping Agent
and PolykeyAgent: Stopped Agent
.
I've created an issue at #270 to track that.
Describe the bug
Looking through the logs, looks like the hook for shutting down never triggered. The log cuts out mid line like the process was terminated. That must've been when you had to manually kill it.
This might be a mac specific problem.
To Reproduce (To-Do)
Expected behavior
We handle most signals to trigger stopping the agent. The only thing that should really kill it like this is a sigkill signal.
Screenshots
Platform (please complete the following information)
Additional context
Logs being mentioned can be found here: #183
Notify maintainers
@tegefaulkes