Closed brantfaircloth closed 5 years ago
Dear Brant,
thanks a lot for your detailed report and for using ParGenes. I don't see anything wrong in your command line, and I couldn't find any explanation in the log files.
Here are some things you can try, such that we get more information:
pargenes_main_repository/tests
and to run python run_tests.py
pargenes-hpc-debug.py
. It will run in a safer but slightly slower mode.--model GTR
, and specify this file with -r
when running pargenes.Don't hesitate to allocate less cores for these experiments, if you want to avoid waiting too long in your cluster queue: the error seems to happen very quickly.
Best, Benoit
Hi Benoit,
Thanks for the very quick response! Just a quick update - the tests passed fine and i tried pargenes-hpc-debug.py
with a smaller set of loci: that ran just fine, too. I'm checking w/ our HPC folks to see if file creation hard limits may have caused the problem and will report back once I hear from them.
Hi Benoit,
Ok - I think that I tracked down the error. I've attached to this email the alignment causing the issues if you are interested in testing further (uce-7550.fasta.txt).
Basically, it looks like the alignment kicks off a floating point error during the modeltest step, and I think that this is somehow due to the number of identical sequences within the alignment. To test this, I first ran the problematic alignment through the version of modeltest
that was compiled as part of pargenes
with the following command:
/project/brant/home/src/pargenes-mpi/modeltest/bin/modeltest-ng -i uce-7550.fasta -t mp -o test-out-uce-7550 --verbose
and received the floating point error - reported as:
[dbg] Building parameters and computing initial lk score
[dbg] Initial score: -2346.89
[dbg] Initial log likelihood: -2346.89
[dbg] final parameter optimization: -2346.89
[dbg] fix branches -1623.1
[dbg] optimize BranchLengths: -1623.1
[dbg] optimize P-inv: -1621.76
[dbg] optimize FixedFrequencies: -1621.76
[dbg] optimize Alpha: -1621.74
[dbg] optimize SubstRates: -1621.74
[dbg] fix branches -1613.49
[dbg] optimize BranchLengths: -1613.49
[dbg] optimize P-inv: -1613.3
[dbg] optimize FixedFrequencies: -1613.3
[dbg] optimize Alpha: -1613.3
[dbg] optimize SubstRates: -1613.3
[dbg] fix branches -1613.29
[dbg] optimize BranchLengths: -1613.29
[dbg] optimize P-inv: -1613.29
[dbg] optimize FixedFrequencies: -1613.29
[dbg] optimize Alpha: -1613.29
[dbg] optimize SubstRates: -1613.29
[dbg] model done: [0.01/0.01]: -1613.29
[dbg] Model optimization done: -1613.29
Floating point exception
Then, to see if the "identical" sequences are problematic, I decided to remove those by running the alignments through RAxML to generate a *.reduced
alignment. I then input that alignment (uce-7550.fasta.reduced.txt) to modeltest, which ran as expected and did not have the floating point error.
So, it seems like some alignments with a large proportion of "identical" sequences (i.e., sequences w/ no variation with respect to each other) cause modeltest to die, and removing those sequences allows modeltest to proceed. It seems also like this error (at this locus and others) is the one that's killing some of these larger jobs.
I'm about to run all of the loci input to pargenes
through RAxML in order to produce *.reduced
files where needed and input the total set of alignments to pargenes
. I will let you know how that run goes.
Quick update: that seems to have fixed the error - I just ran >4300 alignments through pargenes
in ~12 minutes. I think I may have run into another small error that I will start a different issue for.
So, long story short, I run my alignments through raxml to generate reduced versions where needed, merge reduced versions with regular versions, and all is well. If anyone runs into this error in the meantime, the (application specific) code that I wrote to run the raxml
reducation is here: https://github.com/faircloth-lab/phyluce/blob/master/bin/align/phyluce_align_reduce_alignments_with_raxml
Hi Brant,
thanks a lot for the investigation!
I debugged modeltest with your alignment. Your ran into the following issue: https://github.com/ddarriba/modeltest/issues/23 It was fixed in modeltest but I forgot to update it in the ParGenes fork. Now it should be fine.
If you want to get the fix, you can either clone the project again or run ./gitpull.sh
(git pull
is not enough to update modeltest submodule).
Benoit
Hi Benoit,
Thanks for your work on ParGenes, modeltest-ng, and raxml-ng. I've run into a weird issue that I can't quite figure out when I've been using Pargenes in MPI mode across a large number of cores - basically, the ModelTest step is failing with an unspecified error that is not made clear to me by digging around in the log files or the output files.
I've attached the
pargenes_logs.txt
,report.txt
, andlogs.txt
files below. I compiled pargenes with a few modules loaded (listed immediately below) and received no errors.checker.sh
reports all is well and the non-mpi version of the code seems to run as expected (in both--dry-run
and regular modes).I ran the job with 512 cores using the following submission script:
Also, pargenes seems to be parallelizing fine w/ IMPI - the initial steps of the analysis appear to have run well across 512 cores.
When digging around in the
modeltest_run
output, I can't find anything diagnostic in either the output files inrunning_jobs
or in theresults
files, other than the fact that some of the*.out
files inresults
are truncated - possibly due to the MPI run dying before they were fully written.In addition to the log files I attached, I've also packaged up all of the output from the run, which is available below, as well. Just FYI, within
modeltest_run
there is an extra directory I made (some-results
) that contains only those directories frommodeltest_run/results/
for the loci that were being processed when the job died - so that it's easier to look at them.Please let me know if I can include anything else that might help diagnose the issue (which could be operator error on my part).
Thank you very much, -brant
Attached
logs.txt pargenes_logs.txt report.txt
entire output directory