One thread per physical core? Poor performance compared to RAXML8 on cluster

pierrj commented 2 years ago

I am trying to run raxml-ng on my university's cluster. I compiled raxml-ng as suggested in order to get it to work using course grain parallelization as suggested. However, I am getting quite poor performance compared to RAXML8 (which was already pre-installed on the cluster). RAXML8 finishes (full run, including bootstrapping) in less than 48 hours while raxml-ng seems to still be plugging away past 48 hours, even with double the nodes assigned to the job and even with only doing the tree search, no bootstrapping.

I am wondering if this might have something to do with your warning that raxml-ng does not work well with multiple threads per physical core. The nodes I am working with have dual-threaded cores. I reduced the number of threads to 10 (1 per physcial core) but I am not sure if this is working properly. Under htop I see the first ten numbered cores are active and the second ten aren't but its hard to tell if that is 5 physical cores doing all of the work or 10.

Do you have any advice for dealing with this or anything to take a look at? Also, please let me know if this would be a question for my cluster support instead!

Here are my submitted jobs for your reference. Please let me know if you need any other info. raxml-ng

#!/bin/bash
#SBATCH --job-name=raxml_ng_scos_savio1
#SBATCH --partition=savio
#SBATCH --qos=savio_normal
#SBATCH --nodes=20
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --time=72:00:00
#SBATCH --mail-user=pierrj@berkeley.edu
#SBATCH --mail-type=ALL
#SBATCH --output=/global/home/users/pierrj/slurm_stdout/slurm-%j.out
#SBATCH --error=/global/home/users/pierrj/slurm_stderr/slurm-%j.out

/global/scratch/users/pierrj/raxml_ng_savio1/bin/raxml-ng-mpi --parse --msa msa.fasta --model PROTGTR+G --prefix savio1_T1

mpirun /global/scratch/users/pierrj/raxml_ng_savio1/bin/raxml-ng-mpi --msa savio1_T1.raxml.rba --prefix savio1_H3 --threads 10 --extra thread-pin

RAXML8

#!/bin/bash
#SBATCH --job-name=raxml_multinode
#SBATCH --partition=savio
#SBATCH --qos=savio_normal
#SBATCH --nodes=10
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10
#SBATCH --time=72:00:00
#SBATCH --mail-user=pierrj@berkeley.edu
#SBATCH --mail-type=ALL
#SBATCH --output=/global/home/users/pierrj/slurm_stdout/slurm-%j.out
#SBATCH --error=/global/home/users/pierrj/slurm_stderr/slurm-%j.out

mpirun raxmlHPC-HYBRID-SSE3 -s msa.fasta -n raxmlv8 -m PROTGAMMAGTR -T 10 -f a -x 12345 -p 12345 -# 100

amkozlov commented 2 years ago

Hi Pierre,

could you please post full log files for both raxml-ng and RAxML8 runs?

Based on your description, thread/core mapping is fine (first 10 cores in htop usually correspond to physical cores 1-10, and the next 10 are respective hyperthreading "twins"). The poor performance could rather be due to the PROTGTR model, I vaguely recall fixing a major inefficiency with it a few months ago.

pierrj commented 2 years ago

Ah I see, thanks for the response!

I will wait until both runs finish and post full run times and log files.

amkozlov commented 2 years ago

OK, you can also try changing model to PROTGTR+G+F (=empirical AA frequencies) , since IIRC this would be the exact equivalent to PROTGAMMAGTR in RAxML8.

pierrj commented 2 years ago

It looks like I misunderstood some of the outputs. Now it doesn't look like there is much of a difference in performance actually. Closing the issue for now. Thank you for your help!

amkozlov / raxml-ng

One thread per physical core? Poor performance compared to RAXML8 on cluster #139