DARMA-tasking / vt

DARMA/vt => Virtual Transport
Other
35 stars 9 forks source link

Hang detection sometimes produces infinite output #1062

Open PhilMiller opened 4 years ago

PhilMiller commented 4 years ago

Describe the bug

As seen here: https://github.com/DARMA-tasking/vt/pull/1059/checks?check_run_id=1135514073

vt: [0] termination: Progress has stalled, but hang detection implies messages are in flight! Will print indefinitely. This can blow up job logs. This should be throttled or limited in some fashion, or hang detection should behave more aggressively in tests, and fail the run immediately.

To Reproduce Steps to reproduce the behavior:

  1. Example/test/snippet of code that fails
  2. Compiler, platform, libraries
  3. Run command: number of processors, threading options, etc.
  4. See error

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Platform (please complete the following information):

Additional context Add any other context about the problem here.

lifflander commented 4 years ago

After some experimentation, I'm determined that this is typically due to a hang while in runInEpochCollective. If a hang happens and the user is running in the top-level scheduler it exits properly. However, if that's not the case it continues turning it and causing more output. I'm not exactly sure why but the code that supposed to break out/abort when hangs happen needs to be examined.

lifflander commented 3 years ago

We need a consistent reproducer for this to be fixed.