fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
568 stars 219 forks source link

Does not terminate (hangs) on panic #1418

Closed dpc closed 1 year ago

dpc commented 1 year ago

On error I've spotted in https://github.com/fedimint/fedimint/issues/1417#issue-1549749091 the fedimint killed all the peer connections, but then did not stop, preventing systemd from starting it again.

dpc commented 1 year ago

I'd say - we should probably add a a small task polling on the is_shutting down flag, and if set to true, print that shutdown was detect, wait 30 seconds, print that timeout was reached and std::env::exit(-3) or something to force process termination, unless certain other flag is set (from join_all after it finished).

(Obviously this task has to be outside of the task group itself, just a one off only for this purpose).

This should take care of all but most weird bugs that would cause a hang.

elsirion commented 1 year ago

Yes, and maybe print the running tasks/threads every second during that time. That way we can debug which task is misbehaving. We should also try to name our tasks/threads for this purpose.

elsirion commented 1 year ago

Is the TaskGroup supposed to shut down at all right now if the main task panics? I don't see it implemented anywhere. So I think we need to do two things:

  1. Have a panic handler in main or implement Drop for TaskGroup
  2. Limit the time shutdown can take

Some debugging improvements:

  1. Parallelize shutdown of tasks, so we see which ones hang
  2. Log which ones are shutting down and periodically which ones are still running

Does this sound good @dpc?

dpc commented 1 year ago
  1. Unclear. Maybe we should just move all the code in main to task group right away? I also wonder if panicking main task doesn't kill the whole process anyway. The problem is that the TaskGroup can be cloned and used for sub-tasks, and they usually already inside their own task that is already tracked. And then each use of TaskGroup will have to remember to call something that disables panic on drop, which I'm not sure if all the code already is doing. Seems kind of easier to rename main to main_inner that take TaskGroup already and have everything in the tasks from there on.
  2. Yes.
  3. We probably could avoid parallize shutdown (.join on the result), and just do wrap it in a timeout to a deadline. If the join had a timeout, print the name of the hanged one, extend deadline by 1 second, move to the next one (which had enough time to wrap up by then).
  4. Logging which ones are still on might be past the point of diminishing return.

I wouldn't spend time on 3 and 4 RN, unless you have them all figure out already.

@elsirion