Open benharsh opened 7 years ago
The original test was added in 2008, so it certainly used tasks=fifo. Thus the deadlock here may have had the same root cause as for multilocale/deitz/needMultiLocales/test_big_recursive_on
, namely: a tasking layer that cannot overload tasks on threads may deadlock on back-and-forth on-stmts, depending on the details of how the comm layer is coded. I tried to confirm this hypothesis just now, but there's a compile-time gripe about the underfined Writer
type and I'm busy with something else. If the hypothesis is true, though, then this should not deadlock with tasks=qthreads no matter what the comm setting is, and all that's needed is a skipif tasks=fifo.
I updated the future recently to fix the compilation error, you may need to sync with master.
Aha, yes, I was working in an older branch that didn't have that. When I resync and try it out, it still deadlocks with tasks=qthreads, either with comm=gasnet or comm=ugni. So it's not the same root cause as for multilocale/deitz/needMultiLocales/test_big_recursive_on
.
The test works with this change inside C.writeThis(x)
, directing the writes to x
instead of fresh top-level write()
calls:
--- a/test/multilocale/bradc/needMultiLocales/writeThisUsingOn.chpl
+++ b/test/multilocale/bradc/needMultiLocales/writeThisUsingOn.chpl
@@ -25,8 +25,8 @@ class C {
for loc in LocaleSpace {
on Locales(loc) {
if loc != 0 then
- write(" ");
- write(locCs(loc));
+ x.write(" ");
+ x.write(locCs(loc));
}
}
}
and a clean exit before the "Timed out" halt triggers:
@@ -41,9 +41,10 @@ cobegin {
{
var myC = new unmanaged C();
writeln("C is: ", myC);
+ exit(0);
}
{
Though it also works if I leave the write()
s alone and remove the on
in C.writeThis()
, so the problem's not /just/ about calling write()
within a writeThis()
.
Also, it hangs with the on
in C.writeThis()
in place, but the on
in LocC.writeThis()
removed.
It also hangs like this (diff against an unmodified test), with all
the on
s removed, and the one in C.writeThis()
replaced with a
sync begin {
, and run with -nl 1
.
--- test/multilocale/bradc/needMultiLocales/writeThisUsingOn.chpl 2019-03-06 06:58:21.548287558 -0800
+++ test/multilocale/bradc/needMultiLocales/writeThisUsingBegin.chpl 2019-03-06 06:58:55.641707112 -0800
@@ -4,7 +4,7 @@
var id: int;
proc writeThis(x) {
- on this {
+ { // on this {
x.write(id);
}
}
@@ -15,7 +15,7 @@
proc postinit() {
for loc in LocaleSpace {
- on Locales(loc) {
+ { //on Locales(loc) {
locCs(loc) = new unmanaged LocC(loc);
}
}
@@ -23,7 +23,7 @@
proc writeThis(x) {
for loc in LocaleSpace {
- on Locales(loc) {
+ sync begin {
if loc != 0 then
write(" ");
write(locCs(loc));
Making the same x.write()
change to this variant also does not hang.
So I think it has to do with reentering the "top-level" write()
from
a writeThis()
on multiple tasks. Running on multiple locales
certainly involved multiple tasks.
Summary of Problem
A hanging/deadlocking program can be written by implementing a writeThis method on a class and writing said class across locales.
This appears to have been possible for many years. It was originally suspected that there was an issue with thread IDs, but our tasking runtime has changed drastically since the bug was created such that it seems unlikely that the original suspicion is still valid.
Associated Future Test(s):
test/multilocale/bradc/needMultiLocales/writeThisUsingOn.chpl