NBISweden / MrBayes

MrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. For documentation and downloading the program, please see the home page:
http://NBISweden.github.io/MrBayes/
GNU General Public License v3.0
224 stars 78 forks source link

Checkpointed analyses hang on restart #290

Open DMaddison opened 8 months ago

DMaddison commented 8 months ago

What is the current observed behaviour?

Kip Will and I are independently trying to restart MrBayes (3.2.7) runs from checkpoints, and for both of us the restart is failing. It seems to go fine, but then just hangs (at the same point) with no further output. Although our data matrices are independent, and rather different (Kip’s has over 600 taxa and about 5000 nucleotides; mine has 46 taxa and 1 million nucleotides), we are doing similar fossilized-birth-death analyses.

In my case, I ran a bit over 21 million generations on a 20-core Apple M1 Ultra Mac Studio computer with 128GB RAM, asking for the 16 cores high-performance cores to be used under mpi, using “mpirun -np 16 mb”. I used the ARM version of MrBayes (3.2.7a). Kip is doing his analysis on a Linux box with Intel Xeon chips with a total of 16 hyperthreaded cores (thus 32 apparent cores), and started his run using "nohup mpirun -np 32 mb [filename.nex] &".

After having done an initial run, we then wanted to continue the MCMC analysis from that point but using different swapfreq and temp settings. I added “append=yes” and the new swapfreq and temp settings to the mcmc command in the MrBayes block of the NEXUS file (Kip added them to the mcmcp command), and asked the file to be executed again.

All seemed good after invoking the mcmc command. MrBayes chugs through the NEXUS file, gets into the MrBayes block, eventually copies the .p and .t files, making .p~ and .t~ copies, hums along, and then it just stops doing anything apparent. Here is the last bit of the log:

   Exiting mrbayes block
   Reached end of file
   Returning execution to calling file ...
      Using samples up to generation 21982000 from previous analysis.

      Initial log likelihoods and log prior probs for run 1:
         Chain 1 -- -7027326.220655 -- nan

      There are 15 more chains on other processor(s)
      Using a relative burnin of 25.0 % for diagnostics
      Chain results (continued from previous run; 1000000000 generations requested):

Outputting the "Chain results" line is thus the last thing it does that is apparent; at that point nothing more happens. I've left my machine going for 48 hours, and no files are written, nothing more comes to the log, etc.. In my case, the memory usage per core goes up to about 4.5GB each, but the total memory used by all 16 mb processes is far less than the available 128GB, and there is a lot of unused memory. All 16 cores are showing as active, but with mpirun that is how it appears even if the mb executable isn’t actually doing an analysis at all.

There is one hint in my runs of something amiss. As the MrBayes block is being read in, it of course sends to the log information about what is going on. At one point, it says this:

      Setting number of generations to 1000000000
      Using relative burnin (a fraction of samples discarded).
      Setting burnin fraction to 0.25
      Setting print frequency to 1000
      Setting sample frequency to 1000
      WARNING: Reallocation of zero size attempted. This is probably a bug. Problems may follow.
      WARNING: Reallocation of zero size attempted. This is probably a bug. Problems may follow.
      WARNING: Reallocation of zero size attempted. This is probably a bug. Problems may follow.
      Setting number of runs to 4
      Setting number of chains to 4

Not sure if that warning is important, but that is the only hint of problems.

In case it is relevant, here are some of the commands used to set up the restarted run (Kip's are likely fairly similar):

    prset brlenspr=clock:fossilization; 
    prset samplestrat=diversity;

    prset clockvarpr=mixed;  
    prset clockratepr=normal(0.001,0.02);  

    prset topologypr=constraints([constraints listed here]);
    prset nodeagepr=calibrated;

    mcmcp ngen= 1000000000 relburnin=yes burninfrac=0.25 printfreq=1000  samplefreq=1000 nruns=4 nchains=4 savebrlens=yes;
    mcmc Swapfreq=7 Temp=0.03 append=yes ;
    sumt;

How may we reproduce this bug?

I can supply my files as needed, and I suspect Kip can too.

Would you be able to compile and run MrBayes to test fixes to this bug?

Yes

What is the environment that you run MrBayes in?

Two different environments.

My environment:

Kip's environment:

DMaddison commented 8 months ago

One thing I forgot to mention. We had to modify the .ckp files in order for MrBayes to accept them. In particular, in the trees block the start of the tree commands look like this:

    tree mcmc.tree_1 [&B MixedBrlens 1] = [&R] 
    tree mcmc.tree_2 [&B MixedBrlens 1] = [&R] 
    tree mcmc.tree_3 [&B MixedBrlens 1] = [&R] 
    tree mcmc.tree_4 [&B MixedBrlens 1] = [&R] 
    tree mcmc.tree_5 [&B MixedBrlens 0] = [&R] 
    tree mcmc.tree_6 [&B MixedBrlens 1] = [&R] 
    tree mcmc.tree_7 [&B MixedBrlens 0] = [&R] 
    tree mcmc.tree_8 [&B MixedBrlens 1] = [&R] 
    tree mcmc.tree_9 [&B MixedBrlens 0] = [&R] 
    tree mcmc.tree_10 [&B MixedBrlens 1] = [&R]
    tree mcmc.tree_11 [&B MixedBrlens 1] = [&R]
    tree mcmc.tree_12 [&B MixedBrlens 1] = [&R]
    tree mcmc.tree_13 [&B MixedBrlens 1] = [&R]
    tree mcmc.tree_14 [&B MixedBrlens 1] = [&R]
    tree mcmc.tree_15 [&B MixedBrlens 1] = [&R]
    tree mcmc.tree_16 [&B MixedBrlens 1] = [&R]

MrBayes chokes on the number, 0 or 1, after "MixedBrlens". If they are in, you get an error message and MrBayes stops processing the file. If you remove the 0 or 1, then it appears to accept those lines.