amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
374 stars 62 forks source link

terminate called recursively -- Aborted (core dumped) #152

Open 0xaf1f opened 1 year ago

0xaf1f commented 1 year ago

I'm trying to infer a tree using concatenated SNPs and consistently get this core dump when multithreading. I do not have this problem with the original raxmlHPC, but it does not complete within the 2 day maximum walltime of the cluster and doesn't do checkpointing either (as far as i know), so I can't restart it from where it left off.

So I'm trying raxml-ng (latest version 1.1.0):

RAxML-NG v. 1.1.0 released on 29.11.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: Intel(R) Xeon(R) Platinum 8260M CPU @ 2.40GHz, 96 cores, 4031 GB RAM

RAxML-NG was called at 09-Dec-2022 20:16:42 as follows:

raxml-ng --all --msa SNP.parse.raxml.rba --model GTR+G+ASC_FELS{4087123} --prefix SNP --threads 24 --bs-trees 100

Analysis options:
  run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)
  start tree(s): random (10) + parsimony (10)
  bootstrap replicates: 100
  random seed: 1670635002
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX2
  parallelization: coarse-grained (auto), PTHREADS (24 threads), thread pinning: OFF

WARNING: The model you specified on the command line (GTR+G+ASC_FELS{4087123}) will be ignored
         since the binary MSA file already contains a model definition.
         If you want to change the model, please re-run RAxML-NG
         with the original PHYLIP/FASTA alignment and --redo option.

[00:00:00] Loading binary alignment from file: SNP.parse.raxml.rba
[00:00:00] Alignment comprises 4832 taxa, 1 partitions and 82354 patterns

Partition 0: noname
Model: GTR+FO+G4m+ASC_FELS{4087123}
Alignment sites / patterns: 324409 / 82354
Gaps: 0.00 %
Invariant sites: 0.00 %

Parallelization scheme autoconfig: 3 worker(s) x 8 thread(s)

NOTE: Per-rate scalers were automatically enabled to prevent numerical issues on taxa-rich alignments.
NOTE: You can use --force switch to skip this check and fall back to per-site scalers.

Parallel reduction/worker buffer size: 1 KB  / 0 KB

[00:00:00] Generating 10 random starting tree(s) with 4832 taxa
[00:00:01] Generating 10 parsimony starting tree(s) with 4832 taxa
[02:02:53] Data distribution: max. partitions/sites/weight per thread: 1 / 10295 / 164720
[02:02:55] Data distribution: max. searches per worker: 41

Starting ML tree search with 20 distinct starting trees

terminate called recursively
terminate called recursively
/var/spool/slurm/d/job13600690/slurm_script: line 19: 75715 Aborted                 (core dumped) raxml-ng --all --msa SNP.parse.raxml.rba --model GTR+G+ASC_FELS{$invar_sites} --prefix SNP --threads 24 --bs-trees 100

I have also tried with --threads 1 --workers 1 and it timed out at the 2 day maximum, but the log doesn't indicate that it got anywhere past

Starting ML tree search with 20 distinct starting trees

so even if it worked with 1 thread, it looks like it'd be painstakingly slow.

amkozlov commented 1 year ago

Could you please try to re-run with --workers 1 --log debug and also check the memory usage?

0xaf1f commented 1 year ago

Ok, it ran for a while, then there was an unrelated node failure (a few of my other unrelated jobs on other nodes also failed at the same time, so I think there was a cluster issue at the time). I restarted the run and about a day later raxml-ng fails again with the same error. Here is the second log:


RAxML-NG v. 1.1.0 released on 29.11.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: AMD EPYC 7742 64-Core Processor, 128 cores, 251 GB RAM

RAxML-NG was called at 16-Dec-2022 13:42:14 as follows:

raxml-ng --all --msa SNP.parse.raxml.rba --model GTR+G+ASC_FELS{4087123} --prefix SNP --threads 24 --workers 1 --bs-trees 100 --log debug

Analysis options:
  run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)
  start tree(s): random (10) + parsimony (10)
  bootstrap replicates: 100
  random seed: 1671216134
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX2
  parallelization: PTHREADS (24 threads), thread pinning: OFF

RBA partial loading: OFF
WARNING: The model you specified on the command line (GTR+G+ASC_FELS{4087123}) will be ignored 
         since the binary MSA file already contains a model definition.
         If you want to change the model, please re-run RAxML-NG 
         with the original PHYLIP/FASTA alignment and --redo option.

[00:00:00] Loading binary alignment from file: SNP.parse.raxml.rba
[00:00:02] Alignment comprises 4832 taxa, 1 partitions and 82354 patterns

Partition 0: noname
Model: GTR+FO+G4m+ASC_FELS{4087123}
Alignment sites / patterns: 324409 / 82354
Gaps: 0.00 %
Invariant sites: 0.00 %

|noname|   |GTR+FO+G4m+ASC_FELS{4087123}|   |1-324409|

NOTE: Per-rate scalers were automatically enabled to prevent numerical issues on taxa-rich alignments.
NOTE: You can use --force switch to skip this check and fall back to per-site scalers.

Generating a random starting tree with 4832 taxa
Parallel reduction/worker buffer size: 1 KB  / 0 KB

[00:00:02] NOTE: Resuming execution from checkpoint (logLH: -10351404.46, ML trees: 0, bootstraps: 0)

Initial model parameters:
   Partition: noname
   Rate heterogeneity: GAMMA (4 cats, mean),  alpha: 1.000000 (ML),  weights&rates: (0.250000,0.136954) (0.250000,0.476752) (0.250000,1.000000) (0.250000,2.386294) 
   Base frequencies (ML): 0.250000 0.250000 0.250000 0.250000 
   Substitution rates (ML): 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 

[00:00:02] Data distribution: max. partitions/sites/weight per thread: 1 / 3432 / 54912

thread# part#   start   length  weight
0       0       78936   3418    54688

1       0       75504   3432    54912

2       0       72072   3432    54912

3       0       68640   3432    54912

4       0       65208   3432    54912

5       0       61776   3432    54912

6       0       58344   3432    54912

7       0       54912   3432    54912

8       0       51480   3432    54912

9       0       48048   3432    54912

10      0       44616   3432    54912

11      0       41184   3432    54912

12      0       37752   3432    54912

13      0       34320   3432    54912

14      0       30888   3432    54912

15      0       27456   3432    54912

16      0       24024   3432    54912

17      0       20592   3432    54912

18      0       17160   3432    54912

19      0       13728   3432    54912

20      0       10296   3432    54912

21      0       6864    3432    54912

22      0       3432    3432    54912

23      0       0       3432    54912

Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
[00:00:03] Data distribution: max. searches per worker: 120

BARRIER time: 0.003971 ms

Starting ML tree search with 20 distinct starting trees

[00:00:06 -10351404.457014] SPR radius for FAST iterations: 25 (autodetect)
[00:00:06 -10351404.457014] FAST spr round 6 (radius: 25)
         - after brlen: logLH = -10351387.624913
[00:20:24 -10351387.624913] FAST spr round 7 (radius: 25)
         - after brlen: logLH = -10351373.124242
[00:38:26 -10351373.124242] FAST spr round 8 (radius: 25)
         - after brlen: logLH = -10351362.911341
[00:55:22 -10351362.911341] FAST spr round 9 (radius: 25)
         - after brlen: logLH = -10351356.834441
[01:11:42 -10351356.834441] FAST spr round 10 (radius: 25)
         - after brlen: logLH = -10351351.867060
[01:27:42 -10351351.867060] FAST spr round 11 (radius: 25)
         - after brlen: logLH = -10351347.864267
[01:43:43 -10351347.864267] FAST spr round 12 (radius: 25)
         - after brlen: logLH = -10351344.716538
[01:59:22 -10351344.716538] FAST spr round 13 (radius: 25)
         - after brlen: logLH = -10351341.645818
[02:14:59 -10351341.645818] FAST spr round 14 (radius: 25)
         - after brlen: logLH = -10351338.892258
[02:30:15 -10351338.892258] FAST spr round 15 (radius: 25)
         - after brlen: logLH = -10351335.542908
[02:45:41 -10351335.542908] FAST spr round 16 (radius: 25)
         - after brlen: logLH = -10351333.301802
[03:01:20 -10351333.301802] FAST spr round 17 (radius: 25)
         - after brlen: logLH = -10351331.151827
[03:16:44 -10351331.151827] FAST spr round 18 (radius: 25)
         - after brlen: logLH = -10351329.341439
[03:32:10 -10351329.341439] FAST spr round 19 (radius: 25)
         - after brlen: logLH = -10351327.535476
[03:47:30 -10351327.535476] FAST spr round 20 (radius: 25)
         - after brlen: logLH = -10351322.997378
[04:02:53 -10351322.997378] FAST spr round 21 (radius: 25)
         - after brlen: logLH = -10351313.092002
[04:18:16 -10351313.092002] FAST spr round 22 (radius: 25)
         - after brlen: logLH = -10351311.292615
[04:33:31 -10351311.292615] FAST spr round 23 (radius: 25)
         - after brlen: logLH = -10351309.493888
[04:48:49 -10351309.493888] FAST spr round 24 (radius: 25)
         - after brlen: logLH = -10351307.726005
[05:04:22 -10351307.726005] FAST spr round 25 (radius: 25)
         - after brlen: logLH = -10351306.007903
[05:19:51 -10351306.007903] FAST spr round 26 (radius: 25)
         - after brlen: logLH = -10351304.635382
[05:35:13 -10351304.635382] FAST spr round 27 (radius: 25)
         - after brlen: logLH = -10351299.854038
[05:50:37 -10351299.854038] FAST spr round 28 (radius: 25)
         - after brlen: logLH = -10351298.535978
[06:05:53 -10351298.535978] FAST spr round 29 (radius: 25)
         - after brlen: logLH = -10351297.304712
[06:21:07 -10351297.304712] FAST spr round 30 (radius: 25)
         - after brlen: logLH = -10351296.251681
[06:36:27 -10351296.251681] FAST spr round 31 (radius: 25)
         - after brlen: logLH = -10351292.036071
[06:51:44 -10351292.036071] FAST spr round 32 (radius: 25)
         - after brlen: logLH = -10351291.153234
[07:07:18 -10351291.153234] FAST spr round 33 (radius: 25)
         - after brlen: logLH = -10351290.077628
[07:22:43 -10351290.077628] FAST spr round 34 (radius: 25)
         - after brlen: logLH = -10351289.614201
[07:37:57 -10351289.614201] FAST spr round 35 (radius: 25)
         - after brlen: logLH = -10351279.200180
[07:53:15 -10351279.200180] FAST spr round 36 (radius: 25)
         - after brlen: logLH = -10351276.008304
[08:08:31 -10351276.008304] FAST spr round 37 (radius: 25)
         - after brlen: logLH = -10351275.696520
[08:24:02 -10351275.696520] FAST spr round 38 (radius: 25)
         - after brlen: logLH = -10351275.476474
[08:39:26 -10351275.476474] FAST spr round 39 (radius: 25)
         - after brlen: logLH = -10351260.626399
[08:55:03 -10351260.626399] FAST spr round 40 (radius: 25)
         - after brlen: logLH = -10351258.278455
[09:10:30 -10351258.278455] FAST spr round 41 (radius: 25)
         - after brlen: logLH = -10351257.413502
[09:25:46 -10351257.413502] FAST spr round 42 (radius: 25)
         - after brlen: logLH = -10351257.122438
[09:41:02 -10351257.122438] FAST spr round 43 (radius: 25)
         - after brlen: logLH = -10351257.017756
[09:56:22 -10351257.017756] FAST spr round 44 (radius: 25)
         - after brlen: logLH = -10351257.017756
[10:11:41 -10351257.017756] Model parameter optimization (eps = 1.000000)
         - after rates: logLH = -10349776.244735
         - after freqs: logLH = -10349682.139791
         - after alpha: logLH = -10349574.462679
         - after p-inv: logLH = -10349574.462679
         - after freeR: logLH = -10349574.462679
         - after brlen: logLH = -10349534.314637
Iteration 1: logLH = -10349534.314637
         - after rates: logLH = -10349380.046801
         - after freqs: logLH = -10349299.910029
         - after alpha: logLH = -10349299.639095
         - after p-inv: logLH = -10349299.639095
         - after freeR: logLH = -10349299.639095
         - after brlen: logLH = -10349233.591349
Iteration 2: logLH = -10349233.591349
         - after rates: logLH = -10349220.929805
         - after freqs: logLH = -10349197.990022
         - after alpha: logLH = -10349188.612289
         - after p-inv: logLH = -10349188.612289
         - after freeR: logLH = -10349188.612289
         - after brlen: logLH = -10349157.719741
Iteration 3: logLH = -10349157.719741
         - after rates: logLH = -10349154.688967
         - after freqs: logLH = -10349152.227538
         - after alpha: logLH = -10349148.468388
         - after p-inv: logLH = -10349148.468388
         - after freeR: logLH = -10349148.468388
         - after brlen: logLH = -10349140.074094
Iteration 4: logLH = -10349140.074094
         - after rates: logLH = -10349079.391843
         - after freqs: logLH = -10349062.172494
         - after alpha: logLH = -10349062.172494
         - after p-inv: logLH = -10349062.172494
         - after freeR: logLH = -10349062.172494
         - after brlen: logLH = -10349055.367669
Iteration 5: logLH = -10349055.367669
         - after rates: logLH = -10349051.747483
         - after freqs: logLH = -10349049.803986
         - after alpha: logLH = -10349049.803986
         - after p-inv: logLH = -10349049.803986
         - after freeR: logLH = -10349049.803986
         - after brlen: logLH = -10349047.915509
Iteration 6: logLH = -10349047.915509
         - after rates: logLH = -10349001.372825
         - after freqs: logLH = -10348996.709057
         - after alpha: logLH = -10348995.707157
         - after p-inv: logLH = -10348995.707157
         - after freeR: logLH = -10348995.707157
         - after brlen: logLH = -10348993.074776
Iteration 7: logLH = -10348993.074776
         - after rates: logLH = -10348989.464290
         - after freqs: logLH = -10348988.581751
         - after alpha: logLH = -10348987.916689
         - after p-inv: logLH = -10348987.916689
         - after freeR: logLH = -10348987.916689
         - after brlen: logLH = -10348986.053077
Iteration 8: logLH = -10348986.053077
         - after rates: logLH = -10348985.145344
         - after freqs: logLH = -10348984.964685
         - after alpha: logLH = -10348984.952975
         - after p-inv: logLH = -10348984.952975
         - after freeR: logLH = -10348984.952975
         - after brlen: logLH = -10348983.522732
Iteration 9: logLH = -10348983.522732
         - after rates: logLH = -10348983.200599
         - after freqs: logLH = -10348983.157370
         - after alpha: logLH = -10348983.157370
         - after p-inv: logLH = -10348983.157370
         - after freeR: logLH = -10348983.157370
         - after brlen: logLH = -10348983.006569
Iteration 10: logLH = -10348983.006569
[10:14:26 -10348983.006569] SLOW spr round 1 (radius: 5)
         - after brlen: logLH = -10347710.411143
[10:39:31 -10347710.411143] SLOW spr round 2 (radius: 5)
         - after brlen: logLH = -10347562.616680
[11:04:07 -10347562.616680] SLOW spr round 3 (radius: 5)
         - after brlen: logLH = -10347438.191367
[11:28:24 -10347438.191367] SLOW spr round 4 (radius: 5)
         - after brlen: logLH = -10347363.209677
[11:52:23 -10347363.209677] SLOW spr round 5 (radius: 5)
         - after brlen: logLH = -10347352.031868
[12:15:43 -10347352.031868] SLOW spr round 6 (radius: 5)
         - after brlen: logLH = -10347350.366705
[12:38:54 -10347350.366705] SLOW spr round 7 (radius: 5)
         - after brlen: logLH = -10347345.307373
[13:02:06 -10347345.307373] SLOW spr round 8 (radius: 5)
         - after brlen: logLH = -10347343.897947
[13:24:47 -10347343.897947] SLOW spr round 9 (radius: 5)
         - after brlen: logLH = -10347342.808666
[13:47:44 -10347342.808666] SLOW spr round 10 (radius: 5)
         - after brlen: logLH = -10347338.980500
[14:11:03 -10347338.980500] SLOW spr round 11 (radius: 5)
         - after brlen: logLH = -10347338.980500
[14:34:15 -10347338.980500] SLOW spr round 12 (radius: 10)
         - after brlen: logLH = -10347050.078766
[14:59:52 -10347050.078766] SLOW spr round 13 (radius: 5)
         - after brlen: logLH = -10346990.991364
[15:29:43 -10346990.991364] SLOW spr round 14 (radius: 5)
         - after brlen: logLH = -10346904.424879
[15:56:10 -10346904.424879] SLOW spr round 15 (radius: 5)
         - after brlen: logLH = -10346883.152208
[16:20:15 -10346883.152208] SLOW spr round 16 (radius: 5)
         - after brlen: logLH = -10346875.243318
[16:43:17 -10346875.243318] SLOW spr round 17 (radius: 5)
         - after brlen: logLH = -10346874.840574
[17:06:05 -10346874.840574] SLOW spr round 18 (radius: 5)
         - after brlen: logLH = -10346872.440384
[17:29:03 -10346872.440384] SLOW spr round 19 (radius: 5)
         - after brlen: logLH = -10346872.440384
[17:51:37 -10346872.440384] SLOW spr round 20 (radius: 10)
         - after brlen: logLH = -10346858.054793
[18:15:44 -10346858.054793] SLOW spr round 21 (radius: 5)
         - after brlen: logLH = -10346804.247108
[18:44:54 -10346804.247108] SLOW spr round 22 (radius: 5)
         - after brlen: logLH = -10346804.247108
[19:10:39 -10346804.247108] SLOW spr round 23 (radius: 10)
         - after brlen: logLH = -10346761.657173
[19:36:35 -10346761.657173] SLOW spr round 24 (radius: 5)
         - after brlen: logLH = -10346701.226583
[20:05:56 -10346701.226583] SLOW spr round 25 (radius: 5)
         - after brlen: logLH = -10346700.950547
[20:31:23 -10346700.950547] SLOW spr round 26 (radius: 5)
         - after brlen: logLH = -10346700.950547
[20:54:48 -10346700.950547] SLOW spr round 27 (radius: 10)
         - after brlen: logLH = -10346663.314801
[21:20:00 -10346663.314801] SLOW spr round 28 (radius: 5)
         - after brlen: logLH = -10346617.223015
[21:49:52 -10346617.223015] SLOW spr round 29 (radius: 5)
         - after brlen: logLH = -10346617.223015
[22:15:01 -10346617.223015] SLOW spr round 30 (radius: 10)
         - after brlen: logLH = -10346614.904506
[22:40:49 -10346614.904506] SLOW spr round 31 (radius: 5)
         - after brlen: logLH = -10346524.197975
[23:10:32 -10346524.197975] SLOW spr round 32 (radius: 5)
         - after brlen: logLH = -10346522.361354
[23:35:44 -10346522.361354] SLOW spr round 33 (radius: 5)
         - after brlen: logLH = -10346522.361354
[23:59:23 -10346522.361354] SLOW spr round 34 (radius: 10)
         - after brlen: logLH = -10346522.361354
[24:24:20 -10346522.361354] SLOW spr round 35 (radius: 15)
         - after brlen: logLH = -10346440.916789
[25:10:31 -10346440.916789] SLOW spr round 36 (radius: 5)
         - after brlen: logLH = -10346417.940650
[25:41:11 -10346417.940650] SLOW spr round 37 (radius: 5)
         - after brlen: logLH = -10346417.940650
[26:07:39 -10346417.940650] SLOW spr round 38 (radius: 10)
         - after brlen: logLH = -10346417.940650
[26:34:23 -10346417.940650] SLOW spr round 39 (radius: 15)
         - after brlen: logLH = -10346383.511388
[27:18:57 -10346383.511388] SLOW spr round 40 (radius: 5)
         - after brlen: logLH = -10346363.066663
[27:49:52 -10346363.066663] SLOW spr round 41 (radius: 5)
         - after brlen: logLH = -10346363.066663
[28:16:19 -10346363.066663] SLOW spr round 42 (radius: 10)
         - after brlen: logLH = -10346363.066663
[28:43:06 -10346363.066663] SLOW spr round 43 (radius: 15)
         - after brlen: logLH = -10346353.271026
[29:27:54 -10346353.271026] SLOW spr round 44 (radius: 5)
         - after brlen: logLH = -10346350.970242
[29:58:42 -10346350.970242] SLOW spr round 45 (radius: 5)
         - after brlen: logLH = -10346324.073651
[30:25:39 -10346324.073651] SLOW spr round 46 (radius: 5)
         - after brlen: logLH = -10346300.343150
[30:50:35 -10346300.343150] SLOW spr round 47 (radius: 5)
         - after brlen: logLH = -10346299.867960
[31:13:45 -10346299.867960] SLOW spr round 48 (radius: 5)
         - after brlen: logLH = -10346299.867960
[31:36:29 -10346299.867960] SLOW spr round 49 (radius: 10)
         - after brlen: logLH = -10346299.867960
[32:00:48 -10346299.867960] SLOW spr round 50 (radius: 15)
         - after brlen: logLH = -10346299.867960
[32:48:09 -10346299.867960] SLOW spr round 51 (radius: 20)
         - after brlen: logLH = -10346299.867960
[34:04:05 -10346299.867960] SLOW spr round 52 (radius: 25)
         - after brlen: logLH = -10346299.867960
[35:42:37 -10346299.867960] Model parameter optimization (eps = 0.100000)
         - after rates: logLH = -10346299.698036
         - after freqs: logLH = -10346299.396250
         - after alpha: logLH = -10346299.396250
         - after p-inv: logLH = -10346299.396250
         - after freeR: logLH = -10346299.396250
         - after brlen: logLH = -10346299.331205
Iteration 1: logLH = -10346299.331205
         - after rates: logLH = -10346299.304296
         - after freqs: logLH = -10346299.290322
         - after alpha: logLH = -10346299.290322
         - after p-inv: logLH = -10346299.290322
         - after freeR: logLH = -10346299.290322
         - after brlen: logLH = -10346299.268211
Iteration 2: logLH = -10346299.268211

[35:42:51] ML tree search #1, logLikelihood: -10346299.268211

[35:42:52 -358718771.056143] Initial branch length optimization
         - after brlen: logLH = -0.000000
terminate called recursively
terminate called recursively
/var/spool/slurm/d/job13733182/slurm_script: line 29: 63733 Aborted                 (core dumped) raxml-ng --all --msa SNP.parse.raxml.rba --model GTR+G+ASC_FELS{$invar_sites} --prefix SNP --threads 24 --workers 1 --bs-trees 100 --log debug

I didn't set up a way of tracking the memory usage throughout the job, but at least in the beginning, it was using around a quarter of the available memory (~200G) on the system.

(The very first run I did had access to 1 TB of memory, but I ran out my cpu hour allocation on that system)

0xaf1f commented 1 year ago

Hi, I know this overlapped with the holiday season, but I wanted to make sure it isn't forgotten. My allocation on the cluster expired and I'll see if we can get a new one, but I did manage to get that debug output for you before that happened.

many thanks for all your help

amkozlov commented 1 year ago

Thank you, this debug output was very helpful.

Here are a couple of things you can try:

  1. Add --redo to ignore checkpoint and start a fresh run,
  2. Add --lh-epsilon 10 to accelerate the tree search (convergence is very slow here, which is not uncommon on datasets with thousands of taxa)
  3. Add --blopt nr_safe option (this enables additional checks which makes inference slower but more robust)

As a side note. according to the log file, your system has 128 cores and 250GB of memory. This means that: a) 200GB would be pretty close to the limit, b) you can probably use more than 24 threads (if despite the above, it turns out memory is not the constraint).

0xaf1f commented 1 year ago

sorry for the delay, but I've gotten renewed access to the cluster and just run it with these three settings. Now I've gotten a new error right before the core dump:


RAxML-NG v. 1.1.0 released on 29.11.2021 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

System: AMD EPYC 7742 64-Core Processor, 128 cores, 251 GB RAM

RAxML-NG was called at 01-Feb-2023 13:05:50 as follows:

raxml-ng --redo --lh-epsilon 10 --blopt nr_safe --all --msa SNP.parse.raxml.rba --model GTR+G+ASC_FELS{4087123} --prefix SNP --threads 24 --workers 1 --bs-trees 100 --log debug

Analysis options:
  run mode: ML tree search + bootstrapping (Felsenstein Bootstrap)
  start tree(s): random (10) + parsimony (10)
  bootstrap replicates: 100
  random seed: 1675274750
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  branch lengths: proportional (ML estimate, algorithm: NR-SAFE)
  SIMD kernels: AVX2
  parallelization: PTHREADS (24 threads), thread pinning: OFF

RBA partial loading: OFF
WARNING: Running in REDO mode: existing checkpoints are ignored, and all result files will be overwritten!

WARNING: The model you specified on the command line (GTR+G+ASC_FELS{4087123}) will be ignored 
         since the binary MSA file already contains a model definition.
         If you want to change the model, please re-run RAxML-NG 
         with the original PHYLIP/FASTA alignment and --redo option.

[00:00:00] Loading binary alignment from file: SNP.parse.raxml.rba
[00:00:04] Alignment comprises 4832 taxa, 1 partitions and 82354 patterns

Partition 0: noname
Model: GTR+FO+G4m+ASC_FELS{4087123}
Alignment sites / patterns: 324409 / 82354
Gaps: 0.00 %
Invariant sites: 0.00 %

|noname|   |GTR+FO+G4m+ASC_FELS{4087123}|   |1-324409|

NOTE: Per-rate scalers were automatically enabled to prevent numerical issues on taxa-rich alignments.
NOTE: You can use --force switch to skip this check and fall back to per-site scalers.

Generating a random starting tree with 4832 taxa
Parallel reduction/worker buffer size: 1 KB  / 0 KB

[00:00:04] Generating 10 random starting tree(s) with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
[00:00:04] Generating 10 parsimony starting tree(s) with 4832 taxa
Generating a parsimony starting tree with 4832 taxa
Parsimony seed: 1558950843
Parsimony score of the starting tree: 940193
Generating a parsimony starting tree with 4832 taxa
Parsimony seed: 848384151
Parsimony score of the starting tree: 939019
Generating a parsimony starting tree with 4832 taxa
Parsimony seed: 1052997793
Parsimony score of the starting tree: 939156
Generating a parsimony starting tree with 4832 taxa
Parsimony seed: 400527839
Parsimony score of the starting tree: 938570
Generating a parsimony starting tree with 4832 taxa
Parsimony seed: 1112046720
Parsimony score of the starting tree: 938814
Generating a parsimony starting tree with 4832 taxa
Parsimony seed: 222282738
Parsimony score of the starting tree: 939227
Generating a parsimony starting tree with 4832 taxa
Parsimony seed: 297541634
Parsimony score of the starting tree: 940208
Generating a parsimony starting tree with 4832 taxa
Parsimony seed: 1044300830
Parsimony score of the starting tree: 938393
Generating a parsimony starting tree with 4832 taxa
Parsimony seed: 1611091263
Parsimony score of the starting tree: 939272
Generating a parsimony starting tree with 4832 taxa
Parsimony seed: 570470538
Parsimony score of the starting tree: 940064

Initial model parameters:
   Partition: noname
   Rate heterogeneity: GAMMA (4 cats, mean),  alpha: 1.000000 (ML),  weights&rates: (0.250000,0.136954) (0.250000,0.476752) (0.250000,1.000000) (0.250000,2.386294) 
   Base frequencies (ML): 0.250000 0.250000 0.250000 0.250000 
   Substitution rates (ML): 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 

[00:47:05] Data distribution: max. partitions/sites/weight per thread: 1 / 3432 / 54912

thread# part#   start   length  weight
0       0       78936   3418    54688

1       0       75504   3432    54912

2       0       72072   3432    54912

3       0       68640   3432    54912

4       0       65208   3432    54912

5       0       61776   3432    54912

6       0       58344   3432    54912

7       0       54912   3432    54912

8       0       51480   3432    54912

9       0       48048   3432    54912

10      0       44616   3432    54912

11      0       41184   3432    54912

12      0       37752   3432    54912

13      0       34320   3432    54912

14      0       30888   3432    54912

15      0       27456   3432    54912

16      0       24024   3432    54912

17      0       20592   3432    54912

18      0       17160   3432    54912

19      0       13728   3432    54912

20      0       10296   3432    54912

21      0       6864    3432    54912

22      0       3432    3432    54912

23      0       0       3432    54912

Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
Generating a random starting tree with 4832 taxa
[00:47:05] Data distribution: max. searches per worker: 120

BARRIER time: 0.007015 ms

Starting ML tree search with 20 distinct starting trees

[00:47:09 -614172163.674568] Initial branch length optimization
         - after brlen: logLH = -0.000000

ERROR: ERROR in branch length optimization (LIBPLL-2210): Wrong likelihood derivatives

terminate called recursively
terminate called recursively
/var/spool/slurm/d/job14250675/slurm_script: line 32: 93657 Aborted                 (core dumped) raxml-ng --redo --lh-epsilon 10 --blopt nr_safe --all --msa SNP.parse.raxml.rba --model GTR+G+ASC_FELS{$invar_sites} --prefix SNP --threads 24 --workers 1 --bs-trees 100 --log debug
amkozlov commented 1 year ago

Sorry for the delay! Since I can't see any obvious reasons for this error, could you please send me you input file?

(Even better if you could reproduce this error on a smaller alignment)

Dived-Jin commented 1 year ago

Hi, I also see the same problem, Did you solve this problem?

amkozlov commented 1 year ago

Please send me your input files and full raxml log, and I will have a look.