charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

ckexit with interop hangs sometimes #903

Closed nikhil-jain closed 5 years ago

nikhil-jain commented 8 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/903


In some cases, calling ckexit hangs in non-smp and smp mode.

PhilMiller commented 5 years ago

Original date: 2015-12-04 20:28:45


Proposed patch https://charm.cs.illinois.edu/gerrit/913 https://github.com/UIUC-PPL/charm/commit/01ace8856c08833bca4629d673cd6a2bc42cb720 with some planned revisions for thread safety. That work should happen Monday 12/7.

PhilMiller commented 5 years ago

Original date: 2015-12-07 19:46:37


Discussion w/ Nikhil:

Generate local immediate messages to run on the comm thread, taking the exitCount variable check off the comm thread's normal execution path entirely.

PhilMiller commented 5 years ago

Original date: 2015-12-07 19:47:43


The bug description mentions non-smp builds too. How does that come up?

nikhil-jain commented 5 years ago

Original date: 2017-09-20 20:31:36


This issue was reproducible using examples/charm++/mpi-coexist - multirun_time code.

epmikida commented 5 years ago

Original date: 2017-11-06 21:23:30


This issue, or a related one is now coming up in Charades as well. I still need to explore more, but for me its a hang much earlier on, but will also hang on exit for simple programs. For example, the following program (with a trivial main chare) hangs for netlrts-linux-x86_64 in SMP mode:

<code class="c">
int main(int argc, char** argv) {
  CharmInit(argc, argv);    

  CharmLibExit();                                                               
  return 0;                                                                     
}
</code>
epmikida commented 5 years ago

Original date: 2017-11-06 21:24:08


The example from the interop documentation (examples/charm++/user-driven-interop) also hangs in smp mode.

epmikida commented 5 years ago

Original date: 2019-03-28 18:46:53


user-driven-interop does not reproduce this bug, it was actually just an error in the test that prevents it from running in SMP mode period. The small example posted above also no longer reproduces the issue. The other interop examples in example are currently broken in general, and crash due to other reasons.

epmikida commented 5 years ago

Original date: 2019-04-11 18:32:22


After fixing up the MPI interop examples in mpi-coexist (https://charm.cs.illinois.edu/gerrit/c/charm/+/5051 https://github.com/UIUC-PPL/charm/commit/44d96eb8e463e64321c7c20677b3fee1b9864644) this bug is reproducible again. At least on mpi-darwin-x86_64 smp builds.

epmikida commented 5 years ago

Original date: 2019-04-19 03:11:19


I've cleaned up Nikhil's original patch (https://charm.cs.illinois.edu/gerrit/913 https://github.com/UIUC-PPL/charm/commit/01ace8856c08833bca4629d673cd6a2bc42cb720) to fix this issue. It does not include the immediate messaging improvement that him and Phil alluded to above, but as that was intended to be an improvement to the bug fix, and not the bug fix itself, I think patch 913 should still be merged, rather than leave interop exit broken.