chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

Feature request: non-yielding task lockup detection #9721

Open cassella opened 6 years ago

cassella commented 6 years ago

I think it would be useful if the runtime could detect conditions as in #9425, or https://stackoverflow.com/questions/50278292/forall-not-completing-for-some-domain-sizes (I suspect) where all the threads are executing tasks that never yield or exit.

I could imagine a thread that wakes up every n seconds and checks whether any ready tasks haven't executed in the last interval, or whether any tasks haven't yielded when there are ready tasks. I imagine the thread would be outside the pool used for tasks.

The checking thread wouldn't exist, and any extra state tracking wouldn't be done, when compiled with --fast, or when the individual option isn't enabled / is disabled.

This would make it more obvious what's going on in such cases as the above.

This would trigger falsely if every thread is correctly executing a computation task that's just not yielding for however long the interval is. Maybe that's ok to be on by default if the interval is long enough for any non---fast run? Or this checking could require opting in even in the absence of --fast, though then you'd have to think to look for it.

Or when the user gets impatient and hits ^C, can anything along these lines be reported then? ("A task has been runnable for n seconds"?)

mppf commented 6 years ago

There is a "task report" and "deadlock detection" feature. I personally think these features can't really address the problem without stack tracing, and they don't work on our default tasking layer (qthreads), and they're probably pretty buggy. See

along with their accompanying COMPOPTS/compopts and execopts for example. I'm pointing this out because you probably hadn't heard of them and they're at least prior work in this area, even if they need attention to be really useful.

ronawho commented 6 years ago

Yeah, they're experimental/buggy and only work half the time under fifo. I'd actually like to remove the functionality since it's not practically useful today. https://chapel-lang.org/docs/1.17/usingchapel/debugging.html#flags-for-tracking-tasks is the documentation for them.

This sort of deadlock detection sounds great, but like we saw with the --taskreport and --blockreport features, it's hard to get right, particularly in multi-locale programs.