BenoitMorel / AleRax

Gene tree - species tree reconciliation from gene tree distributions
21 stars 3 forks source link

[BUG] Stalling while using MPI #10

Open StefanFlaumberg opened 3 months ago

StefanFlaumberg commented 3 months ago

Dear Benoit,

Since recent commits AleRax sometimes stalls when using MPI with number of cores close to the number of families, this seems to be the same issue as reported previously here. The issue is also readily reproducible with the tests from the run_alerax_tests.py script (the 4th test stalls).

Since the exact behaviour is quite variable, I provide a detailed description:

For the commit 350996c and before:

Since the commit f0f94e0 (the first compilable version since the 350996c):

To pinpoint the cause of the stalling I added some logger-notes in the code. The notes showed that:

Possible solution: I could entirely fix the problem by removing both ParallelContext::barrier(); statements in the AleState::serialize function from the AleState.cpp file. After the amendment all the tests by the run_alerax_tests.py script run well, the behaviour is reverted to that of the commit 350996c.

However, I am not familiar with the MPI implementation in C++ and do not understand why this solution actually works. So would you, please, have a look?

Thank you!

Best, Stefan

BenoitMorel commented 3 months ago

Hi Stefan, I just accepted your changes, and I hopefully fixed the MPI issues. Thanks a lot for your help, and sorry for the inconvenience. Most of the recent issues were due to me rushing in the last month to add a last feature requested by some collaborators. This was not a really good idea to do this just before leaving... I hope everything works better now Benoit

Le sam. 10 août 2024 à 21:58, StefanFlaumberg @.***> a écrit :

Dear Benoit,

Since recent commits AleRax sometimes stalls when using MPI with number of cores close to the number of families, this seems to be the same issue as reported previously here https://groups.google.com/g/generaxusers/c/fr-b4ITbdHM. The issue is also readily reproducible with the tests from the run_alerax_tests.py script (the 4th test stalls).

Since the exact behaviour is quite variable, I provide a detailed description:

For the commit 350996c and before:

  • If run with --per-species-rates or deafult (global rates), a number of MPI ranks exceeding the number of valid families will lead to a segfault on the first occasion of the Optimizing model rates (light) step, any other number of MPI ranks will do its job fine.
  • If run with --per-family-rates, any number of MPI ranks will do well.

Since the commit f0f94e0 (the first compilable version since the 350996c):

  • Running with --per-species-rates or deafult (global rates) has the same behaviour as before.
  • Runnig with --per-family-rates becomes more unpredictable. For example, if I run a species tree search on 3 families under DTL model with 2 MPI ranks, it goes well, albeit the same run under DL model stalls at the Optimizing model rates (thorough) step (the last optimization step). More typically, however, using more than 1 MPI rank causes stalling on the first occasion of the Optimizing model rates (light) step.

To pinpoint the cause of the stalling I added some logger-notes in the code. The notes showed that:

  • if setting more MPI ranks than the number of valid families, the stalling happens in the AleState::serialize function, just before the last call of ParallelContext::barrier():

[00:00:00] [Species search] Optimizing model rates (light) in AleEvaluator::optimizeModelRates: start per-family-rates mode in AleEvaluator::optimizeModelRates: in per-family-rates mode gonna optParam in optimizeParametersGradient: doing settings.onBetterParametersFoundCallback in AleOptimizer::onBetterParametersFoundCallback: saving checkpoint in AleState::serialize: at the end, before the last barrier ^C

  • otherwise (adequate number of MPI ranks) the stalling happens in the AleEvaluator::computeLikelihood function during the ParallelContext::sumDouble(sumLL) operation:

[00:00:00] [Species search] Optimizing model rates (light) in AleEvaluator::optimizeModelRates: start per-family-rates mode in AleEvaluator::optimizeModelRates: in per-family-rates mode gonna optParam in optimizeParametersGradient: doing settings.onBetterParametersFoundCallback in AleOptimizer::onBetterParametersFoundCallback: saving checkpoint in AleState::serialize: at the end, before the last barrier in optimizeParametersGradient: done settings.onBetterParametersFoundCallback in AleEvaluator::optimizeModelRates: in per-family-rates mode Params optimized! in AleEvaluator::optimizeModelRates: in per-family-rates mode gonna computeLL in AleEvaluator::computeLikelihood: before mpi-summation sumLL=-0.083389 ^C

Possible solution: I could entirely fix the problem by removing both ParallelContext::barrier(); statements in the AleState::serialize function from the AleState.cpp file. After the amendment all the tests by the run_alerax_tests.py script run well, the behaviour is reverted to that of the commit 350996c.

However, I am not familiar with the MPI implementation in C++ and do not understand why this solution actually works. So would you, please, have a look?

Thank you!

Best, Stefan

— Reply to this email directly, view it on GitHub https://github.com/BenoitMorel/AleRax/issues/10, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJ3UKXSZDTVG6LOMRLHALZQZWHBAVCNFSM6AAAAABMKCVKYWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TSMRXGI2DANQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

StefanFlaumberg commented 3 months ago

Hi Benoit,

Thank you for the fixes! The temporary checkpoint disabling from the commit ab7bc8a did it job -- now everything runs fine in the per-family-rates mode.

The new features are great, thank you for implementing them back then! I think they are worth the time spent on debugging :)

StefanFlaumberg commented 3 months ago

But unfortunately, I have to point out that the commit c013727 doesn't solve the issue with freezing when the number of MPI ranks is greater than the number of valid families. However, resetting the number of fake parameters from 3 to as much as _optimizationClasses.getFreeParameters() in the AleEvaluator.cpp line 383 solves the problem.

That said, why would one use more MPI ranks than valid families in the first place? It shouldn't speed up anything, while making it work just elaborates the code. Imho, I would rather revert most of the changes from the commit c013727 (the fake parameters and all those if-conditions) and just hardcode-banned the usage of more ranks than valid families.

If you agree with this restrictive solution, I could implement it in a pull request along with some other minor fixes (like updating help output due to the new options, restricting of concomitant usage of some parameters, like global rate parametrization and origination optimization, etc.) just not to waste your time.

Update: Even with properly set number of fake parameters, superfluous MPI ranks produce segfault if run with --highways. One more reason to restrict such usage.

BenoitMorel commented 3 months ago

Hi Stefan, that sounds good to me

Le ven. 16 août 2024 à 14:36, StefanFlaumberg @.***> a écrit :

But unfortunately, I have to point out that the commit c013727 doesn't solve the issue with freezing when the number of MPI ranks is greater than the number of valid families. However, resetting the number of fake parameters from 3 to as much as _optimizationClasses.getFreeParameters() in the AleEvaluator.cpp line 383 solves the problem.

That said, why would one use more MPI ranks than valid families in the first place? It shouldn't speed up anything, while making it work just elaborates the code. Imho, I would rather revert most of the changes from the commit c013727 (the fake parameters and all those if-conditions) and just hardcode-banned the usage of more ranks than valid families.

If you agree with this restrictive solution, I could implement it in a pull request along with some other minor fixes (like updating help output due to the new options, restricting of concomitant usage of some parameters, like global rate parametrization and origination optimization, etc.) just not to waste your time.

— Reply to this email directly, view it on GitHub https://github.com/BenoitMorel/AleRax/issues/10#issuecomment-2293433379, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJ3UJFCKXS6YIXIC6SQ7TZRXW53AVCNFSM6AAAAABMKCVKYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTGQZTGMZXHE . You are receiving this because you commented.Message ID: @.***>