Strange running time - Githubissues

yzongzjnu commented 5 years ago

Hi All, I encountered a running issue when I used raxml-ng. I have several fasta alignment files which include almost identical base pairs (~100 Kb). I ran raxml-ng using simple command --search1 with 40 threads and 40 processors. I also set an extra option --extra thread-pin. Interestingly, tree estimation for near half of the data was done in 30 minutes, but the others were still running after 10 hours. The comand for those slow running and fast running are the same. I am very curious what happened. May I have your suggestions? Than you!

slow running log: spr subtree cutoff: 1.000000 branch lengths: proportional (ML estimate, algorithm: NR-FAST) SIMD kernels: AVX2 parallelization: PTHREADS (40 threads), thread pinning: ON [00:00:00] Reading alignment from file: Fvb1-1.fa [00:00:00] Loaded alignment with 131 taxa and 100000 sites Alignment comprises 1 partitions and 99810 patterns Partition 0: noname Model: GTR+FO+G4m Alignment sites / patterns: 100000 / 99810 Gaps: 12.70 % Invariant sites: 0.00 % NOTE: Binary MSA file created: Fvb1-1.raxml.rba [00:00:01] Generating 1 random starting tree(s) with 131 taxa [00:00:01] Data distribution: max. partitions/sites/weight per thread: 1 / 2496 / 39936 Starting ML tree search with 1 distinct starting trees [00:00:02 -2446706.034157] Initial branch length optimization [00:03:45 -2150643.568414] Model parameter optimization (eps = 10.000000) [00:10:39 -2026472.931685] AUTODETECT spr round 1 (radius: 5) [00:22:14 -1600179.625765] AUTODETECT spr round 2 (radius: 10) [00:36:20 -1423146.781045] AUTODETECT spr round 3 (radius: 15) [00:48:36 -1423146.634464] AUTODETECT spr round 4 (radius: 20) [00:59:15 -1423146.627833] SPR radius for FAST iterations: 15 (autodetect) [00:59:15 -1423146.627833] Model parameter optimization (eps = 3.000000) [01:04:18 -1416462.991150] FAST spr round 1 (radius: 15) [02:20:14 -1396861.542236] FAST spr round 2 (radius: 15) [03:06:56 -1395900.321824] FAST spr round 3 (radius: 15) [03:43:06 -1395793.704248] FAST spr round 4 (radius: 15) [04:18:01 -1395725.500462] FAST spr round 5 (radius: 15) [04:55:04 -1395586.531471] FAST spr round 6 (radius: 15) [05:31:17 -1395560.367322] FAST spr round 7 (radius: 15) [06:06:59 -1395534.600533] FAST spr round 8 (radius: 15) [06:42:55 -1395509.990334] FAST spr round 9 (radius: 15) [07:19:11 -1395490.329548] FAST spr round 10 (radius: 15) [07:56:17 -1395455.625708] FAST spr round 11 (radius: 15) [08:33:14 -1395440.477801] FAST spr round 12 (radius: 15) [09:09:38 -1395434.647867] FAST spr round 13 (radius: 15) [09:42:14 -1395412.748470] FAST spr round 14 (radius: 15) [10:20:02 -1395409.838896] FAST spr round 15 (radius: 15)

fast running log Analysis options: run mode: ML tree search start tree(s): random (1) random seed: 2 tip-inner: OFF pattern compression: ON per-rate scalers: OFF site repeats: ON fast spr radius: AUTO spr subtree cutoff: 1.000000 branch lengths: proportional (ML estimate, algorithm: NR-FAST) SIMD kernels: AVX parallelization: PTHREADS (40 threads), thread pinning: ON [00:00:00] Reading alignment from file: Fvb7-3.fa [00:00:00] Loaded alignment with 131 taxa and 100000 sites Alignment comprises 1 partitions and 99816 patterns Partition 0: noname Model: GTR+FO+G4m Alignment sites / patterns: 100000 / 99816 Gaps: 10.62 % Invariant sites: 0.00 % NOTE: Binary MSA file created: Fvb7-3.raxml.rba [00:00:01] Generating 1 random starting tree(s) with 131 taxa [00:00:01] Data distribution: max. partitions/sites/weight per thread: 1 / 2496 / 39936 Starting ML tree search with 1 distinct starting trees [00:00:02 -2470541.457341] Initial branch length optimization [00:00:04 -2169333.334185] Model parameter optimization (eps = 10.000000) [00:00:28 -2047481.503946] AUTODETECT spr round 1 (radius: 5) [00:00:53 -1642079.396673] AUTODETECT spr round 2 (radius: 10) [00:01:32 -1485871.849326] AUTODETECT spr round 3 (radius: 15) [00:02:13 -1482075.759808] AUTODETECT spr round 4 (radius: 20) [00:02:49 -1482075.758238] SPR radius for FAST iterations: 15 (autodetect) [00:02:49 -1482075.758238] Model parameter optimization (eps = 3.000000) [00:03:18 -1477386.632791] FAST spr round 1 (radius: 15) [00:04:22 -1441746.677912] FAST spr round 2 (radius: 15) [00:05:09 -1440681.554901] FAST spr round 3 (radius: 15) [00:05:50 -1440573.628348] FAST spr round 4 (radius: 15) [00:06:29 -1440486.524627] FAST spr round 5 (radius: 15) [00:07:08 -1440424.465874] FAST spr round 6 (radius: 15) [00:07:51 -1440294.294433] FAST spr round 7 (radius: 15) [00:08:30 -1440264.243906] FAST spr round 8 (radius: 15) [00:09:15 -1440223.862937] FAST spr round 9 (radius: 15) [00:09:58 -1440208.526986] FAST spr round 10 (radius: 15) [00:10:41 -1440194.092240] FAST spr round 11 (radius: 15) [00:11:23 -1440184.721934] FAST spr round 12 (radius: 15) [00:12:02 -1440175.560527] FAST spr round 13 (radius: 15) [00:12:42 -1440150.777243] FAST spr round 14 (radius: 15) [00:13:21 -1440145.101417] FAST spr round 15 (radius: 15) [00:14:00 -1440145.101399] Model parameter optimization (eps = 1.000000) [00:14:07 -1440129.675730] SLOW spr round 1 (radius: 5) [00:16:01 -1439697.362681] SLOW spr round 2 (radius: 5) [00:17:42 -1439605.842012] SLOW spr round 3 (radius: 5) [00:19:18 -1439605.842010] SLOW spr round 4 (radius: 10) [00:20:41 -1439605.842010] SLOW spr round 5 (radius: 15) [00:23:00 -1439605.842010] SLOW spr round 6 (radius: 20) [00:25:03 -1439605.842010] SLOW spr round 7 (radius: 25) [00:26:11 -1439605.842010] Model parameter optimization (eps = 0.100000) [00:26:18] ML tree search #1, logLikelihood: -1439605.395135 Optimized model parameters: Partition 0: noname Rate heterogeneity: GAMMA (4 cats, mean), alpha: 0.579930 (ML), weights&rates: (0.250000,0.048997) (0.250000,0.300930) (0.250000,0.870724) (0.250000,2.779349) Base frequencies (ML): 0.224439 0.272822 0.275963 0.226775 Substitution rates (ML): 0.999981 3.213585 1.620508 0.518613 3.185612 1.000000 Final LogLikelihood: -1439605.395135 AIC score: 2879746.790269 / AICc score: 2879748.235998 / BIC score: 2882296.254294 Free parameters (model + branch lengths): 268 Best ML tree saved to: ancestral_SNP/arbitrary_fragment/Fvb7-3.raxml.bestTree Optimized model saved to: ancestral_SNP/arbitrary_fragment/Fvb7-3.raxml.bestModel

Execution log saved to: /ancestral_SNP/arbitrary_fragment/Fvb7-3.raxml.log Analysis started: 30-Apr-2019 15:04:53 / finished: 30-Apr-2019 15:31:11 Elapsed time: 1578.159 seconds

yzongzjnu commented 5 years ago

I reran the slow running process using identical command. It finished in an hour. The original one was still running. Why?

amkozlov commented 5 years ago

this looks like a problem with your cluster...

are you sure that both jobs were running on the nodes with identical CPUs? based on your output, I'd say no (SIMD kernels: AVX vs AVX2)
are you sure the node for the "slow" job has 40 physical CPU cores and no other job was running on this node simultaneously?
why are you using thread-pin? what happens if you run without thread pinning?

yzongzjnu commented 5 years ago

Thank you for the suggestions. Though jobs were ran on different nodes, I could not imagine a 20× speed difference. The structure and speed of nodes in cluster are similar. I have no background on cluster structure and job assignment principle, as I know, once the nodes were assigned to my job. They can not be used by other users simultaneously. The nodes will disappear from available list of cluster after allocating jobs. Maybe they are virtual CPUs other than physical processors. I used thread-pin following the instruction here https://github.com/amkozlov/raxml-ng/wiki/Parallelization#thread-pinning. It seemed the running will speed up a little without thread-pin.

amkozlov commented 5 years ago

@yzongzjnu the problem is that even if your nodes have similar performance but different number of CPU cores, there might be multiple threads/core and RAxML-NG performance will suffer. If you are unsure about your node configuration, please run RAxML-NG with default parameters (i.e., without --threads and --extra thread-pin options). This will use CPU autodetection and let OS place the threads, which works just fine In most cases.

amkozlov commented 5 years ago

So this seems to be non-optimal configuration and not a bug -> closing. Please re-open if you still experience performance issues.

amkozlov / raxml-ng

Strange running time #65