chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 421 forks source link

Deadlock/hang in writeThis method used across locales #7710

Open benharsh opened 7 years ago

benharsh commented 7 years ago

Summary of Problem

A hanging/deadlocking program can be written by implementing a writeThis method on a class and writing said class across locales.

This appears to have been possible for many years. It was originally suspected that there was an issue with thread IDs, but our tasking runtime has changed drastically since the bug was created such that it seems unlikely that the original suspicion is still valid.

Associated Future Test(s): test/multilocale/bradc/needMultiLocales/writeThisUsingOn.chpl

gbtitus commented 7 years ago

The original test was added in 2008, so it certainly used tasks=fifo. Thus the deadlock here may have had the same root cause as for multilocale/deitz/needMultiLocales/test_big_recursive_on, namely: a tasking layer that cannot overload tasks on threads may deadlock on back-and-forth on-stmts, depending on the details of how the comm layer is coded. I tried to confirm this hypothesis just now, but there's a compile-time gripe about the underfined Writer type and I'm busy with something else. If the hypothesis is true, though, then this should not deadlock with tasks=qthreads no matter what the comm setting is, and all that's needed is a skipif tasks=fifo.

benharsh commented 7 years ago

I updated the future recently to fix the compilation error, you may need to sync with master.

gbtitus commented 7 years ago

Aha, yes, I was working in an older branch that didn't have that. When I resync and try it out, it still deadlocks with tasks=qthreads, either with comm=gasnet or comm=ugni. So it's not the same root cause as for multilocale/deitz/needMultiLocales/test_big_recursive_on.

cassella commented 5 years ago

The test works with this change inside C.writeThis(x), directing the writes to x instead of fresh top-level write() calls:

--- a/test/multilocale/bradc/needMultiLocales/writeThisUsingOn.chpl
+++ b/test/multilocale/bradc/needMultiLocales/writeThisUsingOn.chpl
@@ -25,8 +25,8 @@ class C {
     for loc in LocaleSpace {
       on Locales(loc) {
         if loc != 0 then
-          write(" ");
-        write(locCs(loc));
+          x.write(" ");
+        x.write(locCs(loc));
       }
     }
   }

and a clean exit before the "Timed out" halt triggers:

@@ -41,9 +41,10 @@ cobegin {
   {
     var myC = new unmanaged C();
     writeln("C is: ", myC);
+    exit(0);
   }
   {
cassella commented 5 years ago

Though it also works if I leave the write()s alone and remove the on in C.writeThis(), so the problem's not /just/ about calling write() within a writeThis().

Also, it hangs with the on in C.writeThis() in place, but the on in LocC.writeThis() removed.

cassella commented 5 years ago

It also hangs like this (diff against an unmodified test), with all the ons removed, and the one in C.writeThis() replaced with a sync begin {, and run with -nl 1.

--- test/multilocale/bradc/needMultiLocales/writeThisUsingOn.chpl       2019-03-06 06:58:21.548287558 -0800
+++ test/multilocale/bradc/needMultiLocales/writeThisUsingBegin.chpl    2019-03-06 06:58:55.641707112 -0800
@@ -4,7 +4,7 @@
   var id: int;

   proc writeThis(x) {
-    on this {
+    { // on this {
       x.write(id);
     }
   }
@@ -15,7 +15,7 @@

   proc postinit() {
     for loc in LocaleSpace {
-      on Locales(loc) {
+      { //on Locales(loc) {
         locCs(loc) = new unmanaged LocC(loc);
       }
     }
@@ -23,7 +23,7 @@

   proc writeThis(x) {
     for loc in LocaleSpace {
-      on Locales(loc) {
+      sync begin {
         if loc != 0 then
           write(" ");
         write(locCs(loc));

Making the same x.write() change to this variant also does not hang. So I think it has to do with reentering the "top-level" write() from a writeThis() on multiple tasks. Running on multiple locales certainly involved multiple tasks.