Open PhilMiller opened 4 years ago
After some experimentation, I'm determined that this is typically due to a hang while in runInEpochCollective
. If a hang happens and the user is running in the top-level scheduler it exits properly. However, if that's not the case it continues turning it and causing more output. I'm not exactly sure why but the code that supposed to break out/abort when hangs happen needs to be examined.
We need a consistent reproducer for this to be fixed.
Describe the bug
As seen here: https://github.com/DARMA-tasking/vt/pull/1059/checks?check_run_id=1135514073
vt: [0] termination: Progress has stalled, but hang detection implies messages are in flight!
Will print indefinitely. This can blow up job logs. This should be throttled or limited in some fashion, or hang detection should behave more aggressively in tests, and fail the run immediately.To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Platform (please complete the following information):
Additional context Add any other context about the problem here.