brantfaircloth commented 5 years ago

Hi Benoit,

Thanks for your work on ParGenes, modeltest-ng, and raxml-ng. I've run into a weird issue that I can't quite figure out when I've been using Pargenes in MPI mode across a large number of cores - basically, the ModelTest step is failing with an unspecified error that is not made clear to me by digging around in the log files or the output files.

I've attached the pargenes_logs.txt, report.txt, and logs.txt files below. I compiled pargenes with a few modules loaded (listed immediately below) and received no errors. checker.sh reports all is well and the non-mpi version of the code seems to run as expected (in both --dry-run and regular modes).

gcc (GCC) 6.4.0
impi/2018.0.128
cmake/3.7.2/INTEL-18.0.0
INTEL/18.0.0

I ran the job with 512 cores using the following submission script:

#PBS -A <allocation name>
#PBS -l nodes=32:ppn=16
#PBS -l walltime=12:00:00
#PBS -q checkpt
#PBS -N pargenes

module load impi/2018.0.128

cd $PBS_O_WORKDIR
CORES=512

python /project/brant/home/src/pargenes-mpi/pargenes/pargenes-hpc.py \
    -a all-loci-fasta \
    -o all-loci-fasta-pargenes-run \
    --core-assignment low \
    -d nt \
    -m \
    -c $CORES

Also, pargenes seems to be parallelizing fine w/ IMPI - the initial steps of the analysis appear to have run well across 512 cores.

When digging around in the modeltest_run output, I can't find anything diagnostic in either the output files in running_jobs or in the results files, other than the fact that some of the *.out files in results are truncated - possibly due to the MPI run dying before they were fully written.

In addition to the log files I attached, I've also packaged up all of the output from the run, which is available below, as well. Just FYI, within modeltest_run there is an extra directory I made (some-results) that contains only those directories from modeltest_run/results/ for the loci that were being processed when the job died - so that it's easier to look at them.

Please let me know if I can include anything else that might help diagnose the issue (which could be operator error on my part).

Thank you very much, -brant

Attached

logs.txt pargenes_logs.txt report.txt

entire output directory

BenoitMorel commented 5 years ago

Dear Brant,

thanks a lot for your detailed report and for using ParGenes. I don't see anything wrong in your command line, and I couldn't find any explanation in the log files.

Here are some things you can try, such that we get more information:

check that the units test pass on your machine: you need to be in pargenes_main_repository/tests and to run python run_tests.py
try using pargenes-hpc-debug.py. It will run in a safer but slightly slower mode.
try without model-test (to see if it's really an issue with modeltest). You need to specify a model: create a file with the following content: --model GTR, and specify this file with -r when running pargenes.

Don't hesitate to allocate less cores for these experiments, if you want to avoid waiting too long in your cluster queue: the error seems to happen very quickly.

Best, Benoit

brantfaircloth commented 5 years ago

Hi Benoit,

Thanks for the very quick response! Just a quick update - the tests passed fine and i tried pargenes-hpc-debug.py with a smaller set of loci: that ran just fine, too. I'm checking w/ our HPC folks to see if file creation hard limits may have caused the problem and will report back once I hear from them.

brantfaircloth commented 5 years ago

Hi Benoit,

Ok - I think that I tracked down the error. I've attached to this email the alignment causing the issues if you are interested in testing further (uce-7550.fasta.txt).

Basically, it looks like the alignment kicks off a floating point error during the modeltest step, and I think that this is somehow due to the number of identical sequences within the alignment. To test this, I first ran the problematic alignment through the version of modeltest that was compiled as part of pargenes with the following command:

/project/brant/home/src/pargenes-mpi/modeltest/bin/modeltest-ng -i uce-7550.fasta -t mp -o test-out-uce-7550 --verbose

and received the floating point error - reported as:

[dbg] Building parameters and computing initial lk score
[dbg] Initial score: -2346.89
[dbg] Initial log likelihood: -2346.89
[dbg] final parameter optimization: -2346.89
[dbg] fix branches -1623.1
[dbg] optimize BranchLengths: -1623.1
[dbg] optimize P-inv: -1621.76
[dbg] optimize FixedFrequencies: -1621.76
[dbg] optimize Alpha: -1621.74
[dbg] optimize SubstRates: -1621.74
[dbg] fix branches -1613.49
[dbg] optimize BranchLengths: -1613.49
[dbg] optimize P-inv: -1613.3
[dbg] optimize FixedFrequencies: -1613.3
[dbg] optimize Alpha: -1613.3
[dbg] optimize SubstRates: -1613.3
[dbg] fix branches -1613.29
[dbg] optimize BranchLengths: -1613.29
[dbg] optimize P-inv: -1613.29
[dbg] optimize FixedFrequencies: -1613.29
[dbg] optimize Alpha: -1613.29
[dbg] optimize SubstRates: -1613.29
[dbg] model done: [0.01/0.01]: -1613.29
[dbg] Model optimization done: -1613.29
Floating point exception

Then, to see if the "identical" sequences are problematic, I decided to remove those by running the alignments through RAxML to generate a *.reduced alignment. I then input that alignment (uce-7550.fasta.reduced.txt) to modeltest, which ran as expected and did not have the floating point error.

So, it seems like some alignments with a large proportion of "identical" sequences (i.e., sequences w/ no variation with respect to each other) cause modeltest to die, and removing those sequences allows modeltest to proceed. It seems also like this error (at this locus and others) is the one that's killing some of these larger jobs.

I'm about to run all of the loci input to pargenes through RAxML in order to produce *.reduced files where needed and input the total set of alignments to pargenes. I will let you know how that run goes.

brantfaircloth commented 5 years ago

Quick update: that seems to have fixed the error - I just ran >4300 alignments through pargenes in ~12 minutes. I think I may have run into another small error that I will start a different issue for.

So, long story short, I run my alignments through raxml to generate reduced versions where needed, merge reduced versions with regular versions, and all is well. If anyone runs into this error in the meantime, the (application specific) code that I wrote to run the raxml reducation is here: https://github.com/faircloth-lab/phyluce/blob/master/bin/align/phyluce_align_reduce_alignments_with_raxml

BenoitMorel commented 5 years ago

Hi Brant,

thanks a lot for the investigation!

I debugged modeltest with your alignment. Your ran into the following issue: https://github.com/ddarriba/modeltest/issues/23 It was fixed in modeltest but I forgot to update it in the ParGenes fork. Now it should be fine.

If you want to get the fix, you can either clone the project again or run ./gitpull.sh (git pull is not enough to update modeltest submodule).

Benoit

BenoitMorel / ParGenes

ModelTest step failing with many cores (unspecified error) #56

Attached