MatrixAI / Polykey-CLI

Polykey CLI - Open Source Decentralized Secret Sharing System for Zero Trust Workflows
https://polykey.com
GNU General Public License v3.0
6 stars 3 forks source link

Polykey-CLI: agent stop command not terminating properly #185

Open CryptoTotalWar opened 6 months ago

CryptoTotalWar commented 6 months ago

Describe the bug

Looking through the logs, looks like the hook for shutting down never triggered. The log cuts out mid line like the process was terminated. That must've been when you had to manually kill it.

This might be a mac specific problem.

To Reproduce (To-Do)

  1. ...
  2. ...
  3. Test on Linux version that same way we got the error on Mac to see if it's an OS isolated incident or not.

Expected behavior

We handle most signals to trigger stopping the agent. The only thing that should really kill it like this is a sigkill signal.

Screenshots

Platform (please complete the following information)

Additional context

Logs being mentioned can be found here: #183

Notify maintainers

@tegefaulkes

linear[bot] commented 6 months ago

ENG-322 Polykey agent stop command not terminating properly

tegefaulkes commented 5 months ago

I think I've seen this while testing recently as well. So it's a certain condition that can happen after running for a while that causes it. That makes it very hard to pin down.

tegefaulkes commented 5 months ago

I suspect that while the pkAgent.stop() is called, and the NodeConnectionManager is stopping with force. It might be possible that we are still handling incoming connections and RPC requests despite the stopping state.

So I'd check this first. Write a test that fires a lot of connections and RPC requests at a NodeConnectionManager and trigger a NodeConnectionManager.stop() with force set to true and false and just see what happens.

Right now I'm basing this off the fact that I caught it doing it in the act. I had a agent running with verbose for a while and triggered it to stop. It entered the stopping state, however it was still serving incoming connections and streams. If we enter a stopping state, forced or not, we should reject all new connections and streams that are attempted.

tegefaulkes commented 4 months ago

I tried approaching this from the outside in. Where I added a bunch of logging and debugging for testing and just waiting for the problem to happen so I can catch it in the act. This isn't working very well...

The running theory is that connection activity is causing a race condition that deadlocks the stopping procedure for networking. It's hard to say exactly what and where this is happening. So I have to work from the bottom up fixing up any potential problems that would cause it.

The usual suspects are the following libraries. This will create macrotasks and if any of them leak then we will fail to close the process after stopping. Also they could potentially deadlock when cleaning up.

  1. js-quic, Holds a bunch of macro tasks and resources. If it fails to clean up properly then it will deadlock.
  2. js-rpc, Similar but is a layer between transport and application code. Lifecycle is determined by the streams it consumes so it seems unlikely to deadlock from that. However it does maintain macrotasks in the form of timers.
  3. js-mds, Similar to js-quic in that it maintains a socket and timers.
  4. js-ws, Also maintains a socket and timers.

On top of the libraries, we have the the following domains that maintain connection lifecycles. These could potentially deadlock when cleaning up.

  1. client
  2. nodes

The main bit of evidence I have so far is...

  1. When I caught the CLI failing to end when stopping I saw it still serving a bunch of agent-agent RPC calls.
  2. It seems to happen more often with more nodes in the network. So more network traffic.
  3. Seems to trigger more often when calling agent stop to trigger stopping.

So knowing this, the most likely suspect is that endless RPC calls are preventing the connection from stopping fully and deadlocking stop. To fix this, I need to add draining state handling to the client and nodes domain, along with adding it to each of the transport libraries we have. I can't guarantee that it will fully fix the issue but at the very least it will cross that potential problem off as the cause. So moving forward.

  1. Look at the NodeConnectionManager domain and add logic that prevents accepting or creating new connections when stopping. Creating connections should already be prevented but we need to check reverse connections.
  2. The NodeConnections need to reject new streams when in the stopping state. Creating new streams should already be prevented, but we need to prevent reverse streams.
  3. Review client domain for any potential connection or stream leaks. Also add the same draining state logic that rejects new connections and streams.

From here we can work our way downwards as needed for the libraries.

  1. js-quic needs draining state support. We need to reject new streams and connections, both forward and reverse when we're stopping.
  2. js-ws needs the same treatment.
  3. js-rpc doesn't create any connections but timeout timers might leak.
  4. js-mdns I haven't worked on this directly, but it's usage of macrotasks can hold the process open. so we need to investigate it.

This is a fair amount of polish work that may not even solve the issue. But it's all stuff I've been meaning to do eventually. May as well get it done now.

CMCDragonkai commented 3 months ago

As per the comment https://github.com/MatrixAI/Polykey-CLI/issues/157#issuecomment-2294889589, I'm not a fan of the Stopping Agent STDERR message, it seems unnecessary, and I don't see how it helps in debugging anything. If the prompt is not returned to the terminal, then stopping is broken. And in every case where the agent is running automatically it should be using --verbose anyway.

In the ideal case, the agent should stop perfectly, and also no need for this 3rd case message. In the unideal case, you just have to add an extra concurrent timer function using js-timer that runs, and reports a warning that the agent cannot stop - this should result in a trace of what exactly is holding it. In the case of nodejs, remember as a JS runtime, there's only 2 things that can hold the process open - any open IO fds, or infinite loops. And if we are able to run the concurrent function, then there can be no infinite loops. Then the only thing to trace is open FDs.

Node as a runtime, does not want to stay running. In fact we previously had an issue https://github.com/MatrixAI/Polykey/issues/307 where node would sometimes just stop running, and we had no idea why. That turned out to be due to "promise deadlocks", and in fact we had an issue to try and trace promise deadlocks live, using the new async hooks api https://nodejs.org/api/async_hooks.html, we ended up not using it, but it's an important API for any concurrent trace debugging https://github.com/MatrixAI/js-logger/issues/15, which we eventually want to collect together into a diagnostics domain: https://github.com/MatrixAI/Polykey/issues/635

Whereas debugging the opposite is supposed to MUCH easier. As you can see by this SO issue: https://stackoverflow.com/questions/26057328/node-js-inspect-whats-left-in-the-event-loop-thats-preventing-the-script-from.

CMCDragonkai commented 3 months ago

So going forward:

  1. Investigate open fds for causing the process to stay open. (REMEMBER THAT NODE DOES NOT ALLOW PROMISE DEADLOCKS TO KEEP THE PROCESS OPEN!) - @tegefaulkes so I don't think investigating any of the async operations makes sense in debugging this.
  2. Get rid of that Stopping Agent message that was introduced from #157.
  3. As part of the Polykey conceptual structure/planning we should be starting on https://github.com/MatrixAI/Polykey/issues/635 to deal with these situations - I think PK has reached a level of complexity requiring dedicated diagnostics.
  4. The exit handler should be producing exit code of 0 when gracefully handled the signals like SIGINT and SIGTERM. It should be not be 130 and 143 respectively!
CMCDragonkai commented 2 months ago

If a graceful exit is halted by leaked connections. Then this is likely connected to #198 which appears to potentially be related to a remotely leaked connection that ends up causing a ungraceful exit of the agent.

CMCDragonkai commented 2 months ago
  1. Get rid of that Stopping Agent message that was introduced from https://github.com/MatrixAI/Polykey-CLI/issues/157.

This can just be replaced by raising the info level to warning level for PolykeyAgent:Stopping Agent and PolykeyAgent: Stopped Agent.

tegefaulkes commented 2 months ago

I've created an issue at #270 to track that.