mpirun Segmentation fault occurred on phylogenomic datasets

bayesiancook / pbmpi

phylobayes mpi

GNU General Public License v2.0

23 stars 9 forks source link

mpirun Segmentation fault occurred on phylogenomic datasets #26

Closed xuliuouc closed 9 months ago

xuliuouc commented 2 years ago

Hi, I try to apply pb_mpi on a phylogenomic dataset with 4 chains to run at the same time with the following commands:

mpirun -np 10 pb_mpi -d matrix2.phy -cat -gtr chain1
mpirun -np 10 pb_mpi -d matrix2.phy -cat -gtr chain2
mpirun -np 10 pb_mpi -d matrix2.phy -cat -gtr chain3
mpirun -np 10 pb_mpi -d matrix2.phy -cat -gtr chain4

But then I got the error like this:

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node R730 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

And if I try to run one chain at one time, the program would work just fine without any errors. Also, the pb_mpi worked well on mitogenome datasets. Can someone please kindly tell me what might go wrong and how I can fix this? Thanks in advance~

bayesiancook commented 2 years ago

Hi,

which version are you using ? 1.9alpha or earlier ?
what is the size of the dataset (number of positions + number of taxa)?
are you running the 4 chains in direct mode or sending them to a queue (e.g. sbatch) ? if on a queue, then with 4 different scripts ?

best,

Nicolas

xuliuouc commented 2 years ago

Hi,

which version are you using ? 1.9alpha or earlier ?

what is the size of the dataset (number of positions + number of taxa)?

are you running the 4 chains in direct mode or sending them to a queue (e.g. sbatch) ? if on a queue, then with 4 different scripts ?

best,

Nicolas

Hi,

Thank you for your quick reply. I appreciate it. Here are the answers to the questions.

version: pb_mpi 1.8c
dataset size:
number of taxa: 30 number of sites: 396984
I am running the 4 chains in direct mode.

And if you need more information, please let me know. Thanks again. ^.^

Best, Xu

bayesiancook commented 2 years ago

I guess it is a memory allocation problem: 400 000 aligned positions is a bit large, given the memory requirements of phylobayes. I don't think you will get converge on such a big dataset. I usually don't go too much beyond 100 000 positions (in the manual, I say, up to 50 000).

You could try jackknifing your data (see manual and references in it).

best

Nicolas

xuliuouc commented 2 years ago

I guess it is a memory allocation problem: 400 000 aligned positions is a bit large, given the memory requirements of phylobayes. I don't think you will get converge on such a big dataset. I usually don't go too much beyond 100 000 positions (in the manual, I say, up to 50 000).

You could try jackknifing your data (see manual and references in it).

best

Nicolas

Hi,

I've actually read the manual before, and I did notice the suggested data size. But I still want to give it a try. I thought it wouldn't be a problem since I got 1.5T memory in our machine. Thanks for pointing that out and I will try to jackknife my matrix first as you suggested. Thanks again for your kindly suggestions. ^.^

Best, Xu

bayesiancook commented 9 months ago

closed