Closed dpc closed 1 year ago
I'd say - we should probably add a a small task polling on the is_shutting
down flag, and if set to true
, print that shutdown was detect, wait 30 seconds, print that timeout was reached and std::env::exit(-3)
or something to force process termination, unless certain other flag is set (from join_all
after it finished).
(Obviously this task has to be outside of the task group itself, just a one off only for this purpose).
This should take care of all but most weird bugs that would cause a hang.
Yes, and maybe print the running tasks/threads every second during that time. That way we can debug which task is misbehaving. We should also try to name our tasks/threads for this purpose.
Is the TaskGroup
supposed to shut down at all right now if the main
task panics? I don't see it implemented anywhere. So I think we need to do two things:
Drop
for TaskGroup
shutdown
can takeSome debugging improvements:
Does this sound good @dpc?
TaskGroup
can be cloned and used for sub-tasks, and they usually already inside their own task that is already tracked. And then each use of TaskGroup
will have to remember to call something that disables panic on drop, which I'm not sure if all the code already is doing. Seems kind of easier to rename main
to main_inner
that take TaskGroup
already and have everything in the tasks from there on..join
on the result), and just do wrap it in a timeout to a deadline. If the join
had a timeout, print the name of the hanged one, extend deadline by 1 second, move to the next one (which had enough time to wrap up by then).I wouldn't spend time on 3 and 4 RN, unless you have them all figure out already.
@elsirion
On error I've spotted in https://github.com/fedimint/fedimint/issues/1417#issue-1549749091 the fedimint killed all the peer connections, but then did not stop, preventing
systemd
from starting it again.