amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
376 stars 62 forks source link

OOM Errors on large bootstrapping job #116

Closed brantfaircloth closed 3 years ago

brantfaircloth commented 3 years ago

Good afternoon,

I've run into an odd problem that I don't quite understand. I have used v1.0.1 (self compiled) to infer ML trees from a large alignment (2760 taxa, 1 partition and 3,137,386 patterns) with MPI (and AVX) distributing the job across 800 cores (on 40 nodes). Each node has 64 GB RAM (2,560 GB total) which exceeds the estimated requirement of ~2,113 GB. This job ran reasonably well.

However, I've moved on to bootstrapping and seem to keep hitting an OOM error at the start of the job. I cannot understand what might be causing the problem because I would expect bootstrapping to also run reasonably well because the ML tree search ran reasonably well.

I ran raxml-ng with:

mpiexec -np 1000 -machinefile $PBS_NODEFILE /project/brant/shared/bin/raxml-ng-mpi-1.0.1 \
    --msa 01_2021_2760_uces.phylip.raxml.rba \
    --seed $SEED \
    --bootstrap

And the job eventually dies with the log reporting that raxml is in the process of:

[00:29:36] Data distribution: max. partitions/sites/weight per thread: 1 / 3138 / 50208

The HPC system reports back that nodes are hitting an OOM manager error, which is killing the job - although no additional diagnosable information is provided. To try and battle against that by provided more RAM to the job, I increased the number of nodes assigned to 50 (3,200 GB RAM). That led to the same error.

So, what I'm wondering is if (1) what I am doing is silly and I should really just generate the BS trees, then bootstrap over those in two separate steps and/or (2) if there is something odd going on in the bootstrapping code causing the error (as above, I would expect bootstrapping "just to work" like it usually does... but this is a pretty big alignment and I can see where things might get weird).

Any thoughts/suggestions appreciated. In the meantime, I'll go ahead and test #1.

amkozlov commented 3 years ago

Hello Brant,

I think the problem here are site weight vectors for bootstrap replicates. Usually their size is negligible compared to CLV vectors, but not in your case with very large alignment parallelized across hundreds of MPI ranks.

There are two things that should help (you might have to use both):

brantfaircloth commented 3 years ago

Hi Alexey,

I hope you are doing well! Thanks for the pointers - I'll give those a shot and report back 👍.

-b

brantfaircloth commented 3 years ago

Hi Alexey,

That seems to have done the trick - I'm testing with a batch of 10 bootstrap trees and one MPI rank per node (across 50 nodes).

Thank you! -b

amkozlov commented 3 years ago

perfect, thanks for the confirmation!