Open cassella opened 6 years ago
There is a "task report" and "deadlock detection" feature. I personally think these features can't really address the problem without stack tracing, and they don't work on our default tasking layer (qthreads), and they're probably pretty buggy. See
along with their accompanying COMPOPTS/compopts and execopts for example. I'm pointing this out because you probably hadn't heard of them and they're at least prior work in this area, even if they need attention to be really useful.
Yeah, they're experimental/buggy and only work half the time under fifo. I'd actually like to remove the functionality since it's not practically useful today. https://chapel-lang.org/docs/1.17/usingchapel/debugging.html#flags-for-tracking-tasks is the documentation for them.
This sort of deadlock detection sounds great, but like we saw with the --taskreport
and --blockreport
features, it's hard to get right, particularly in multi-locale programs.
I think it would be useful if the runtime could detect conditions as in #9425, or https://stackoverflow.com/questions/50278292/forall-not-completing-for-some-domain-sizes (I suspect) where all the threads are executing tasks that never yield or exit.
I could imagine a thread that wakes up every n seconds and checks whether any ready tasks haven't executed in the last interval, or whether any tasks haven't yielded when there are ready tasks. I imagine the thread would be outside the pool used for tasks.
The checking thread wouldn't exist, and any extra state tracking wouldn't be done, when compiled with --fast, or when the individual option isn't enabled / is disabled.
This would make it more obvious what's going on in such cases as the above.
This would trigger falsely if every thread is correctly executing a computation task that's just not yielding for however long the interval is. Maybe that's ok to be on by default if the interval is long enough for any non-
--fast
run? Or this checking could require opting in even in the absence of--fast
, though then you'd have to think to look for it.Or when the user gets impatient and hits
^C
, can anything along these lines be reported then? ("A task has been runnable for n seconds"?)