SystemsGenetics / ACE

Accelerated Computational Engine (ACE) is a GPU-enabled framework to simplify creation of GPU-capable applications
http://SystemsGenetics.github.io/ACE
GNU General Public License v2.0
1 stars 1 forks source link

Error code when running ACE applications with "mpirun -np 1" #93

Closed bentsherman closed 5 years ago

bentsherman commented 5 years ago

If I run an ACE application with mpirun -np 1, the program returns exit code 1 even if it finishes successfully:

$ mpirun -np 1 aceex run import --in Yeast.txt --out Yeast.tab
1%      
[...]
100%
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 10441 on
node eceftl3 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

--------------------------------------------------------------------------
$ echo $?
1

I know the import analytic doesn't actually use MPI, it's just easier to showcase. The same thing happens no matter what analytic you run. This is also a bit of a degenerate case because you could just run aceex without mpirun, and so in practice the user can just do that.

The problem arises with scripts that run ACE applications automatically. For example, take my benchmarking pipeline:

https://github.com/bentsherman/benchmark-nf/blob/master/kinc/main.nf#L345

This script essentially runs KINC many times while varying the number of processes. So in the case of np = 1, KINC will run successfully but return exit code 1, which nextflow will interpret as a failure.

This is admittedly a minor issue. I might be able to write some bash logic to get around it for my nextflow pipeline, but I wanted to go ahead and document this problem. @4ctrl-alt-del do you think it would be easy to resolve this issue? I'm pretty sure the cause is that ACE uses the single runner instead of MPI runner if there's only 1 process, so MPI_Init() and MPI_Finalize() are never called.

bentsherman commented 5 years ago

@4ctrl-alt-del Here's a fix I came up with:

void AbstractManager::finish()
{
   // Add the debug header.
   EDEBUG_FUNC(this);

   // Call the finish interface for this manager's analytic.
   _analytic->finish();

   // Call all this object's output abstract data finish interfaces and then their
   // finalize methods.
   for (auto data: qAsConst(_outputData))
   {
      data->data()->finish();
      data->finalize();
   }

   // Shutdown MPI instance to ensure that MPI_Finalize() is called.
   QMPI::shutdown();

   // Signal this manager is finished with execution and ready to be deleted.
   emit finished();
}

That way no matter what runner is used, MPI_Finalize() will always be called if it was initialized. This fixed my problem for the above test case and it doesn't cause problems if you remove mpirun.

The way you designed the QMPI class made this really easy :)

bentsherman commented 5 years ago

Actually it may not be that simple, I was testing KINC with this fix and got this:

$ mpirun -np 5 kinc run similarity --input Yeast.emx --ccm Yeast.emx --cmx Yeast.cmx --clusmethod gmm --corrmethod spearman --preout --postout
1%  24m34s  1d16h32m6s
[...]
99% 15h13m37s   9m13s
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
*** The MPI_Send() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[(null):1474] Local abort after MPI_FINALIZE completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Iprobe() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[(null):1476] Local abort after MPI_FINALIZE completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Iprobe() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[(null):1478] Local abort after MPI_FINALIZE completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Iprobe() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[(null):1475] Local abort after MPI_FINALIZE completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Iprobe() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[(null):1477] Local abort after MPI_FINALIZE completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!

So the QMPI::shutdown() that I inserted seems to be premature in the case of many processes, I bet it's because of the termination signal that the master sends to all of the workers at the end.

bentsherman commented 5 years ago

Also this error happened only in the CPU mode (5 out of 5 failed) but not in the GPU mode (3 out of 3 succeeded), so I don't know if that has anything to do with it.

4ctrl-alt-del commented 5 years ago

I must admit this is a very strange bug. Who would run MPI -np 1 ? Anyway I fixed it in commit 21bf42de7e29eb9916356d9f4c6d2d1719ebf730. The problem was ACE has no way to differentiate between a user doing mpirun -np 1 or running it without MPI. So mpirun -np 1 makes ACE do a Single analytic run which has no concept of safely shutting down MPI. I solved it by adding a safety "catch all" shutdown when everything is finished in EApplication.

bentsherman commented 5 years ago

I still have a similar issue as above, this time the error is slightly different:

*** The MPI_Comm_free() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[(null):5795] Local abort after MPI_FINALIZE completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!

But the same general cause, ACE is still calling MPI functions somewhere after MPI is shut down. We need some way to verify that no MPI calls are pending before calling MPI_Finalize(). I should note that this error happens for any number of processes, so the develop branch is kinda broken right now. :/

In the first case the culprit was MPI_Iprobe, which is in QMPI::probe(), which is called repeatedly by a timer. Maybe my initiial solution would work if we just cancel that timer event in QMPI::shutdown().

In the second case the culprit was MPI_Comm_free, which is only called once to release the local comm object. I guess you could fix that by adding a barrier?

QMPI::~QMPI()
{
   if ( !_failed )
   {
      MPI_Comm_free(&_local);
      MPI_Barrier(MPI_COMM_WORLD);
      MPI_Finalize();
   }
}
4ctrl-alt-del commented 5 years ago

I love it when a bug fix exposes another bug. Fixed QMPI shutdown bug in commit b84e71cf18c3ef7f64f6642394d4ee985081961c. Let me know if this works for you @bentsherman .

4ctrl-alt-del commented 5 years ago

@bentsherman bump

4ctrl-alt-del commented 5 years ago

I am closing this myself because the bug is fixed on my end and @bentsherman has not responded in over a week.

bentsherman commented 4 years ago

@4ctrl-alt-del sorry I never got back to you on this. I think I was waiting to confirm the fix worked (which it did) and then I got caught up in other things.