amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
376 stars 62 forks source link

Get a tree from checkpoint #132

Closed gaetansnl closed 2 years ago

gaetansnl commented 2 years ago

Hello, My computation crashed after 2 months of work. Is it possible to get a tree (even if it's not perfect) from the checkpoint file ? Thank you

amkozlov commented 2 years ago

Hi @gaetansnl,

with a recent version of raxml-ng (1.0+ I guess), you can find the latest found tree in the $PREFIX.raxml.lastTree.TMP file.

With an older version, you can try a workaround described here:

https://groups.google.com/g/raxml/c/MkV1tYtz1tM/m/KhVx2maLBAAJ

gaetansnl commented 2 years ago

@amkozlov Thank you for your answer. It looks like the job is restarted. From the workaround you linked, I'm not sure what argument I should pass to stop the SLOW SPR round phase. I'm at SLOW SPR round 102.

amkozlov commented 2 years ago

SPR round 102 does not sound healthy, could you please post your log file?

gaetansnl commented 2 years ago

@amkozlov It's probably because I try to do something not very efficient (15903 taxa)

RAXML-NG v. 0.9.0 released on 20.05.2019 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

RAxML-NG was called at 09-Nov-2021 07:59:28 as follows:

raxml-ng --msa T1.raxml.rba --model BIN+G --threads 12 --tree pars{1} --seed 12345 --outgroup ERR2512533 --prefix T5

Analysis options:
  run mode: ML tree search
  start tree(s): parsimony (1)
  random seed: 12345
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  fast spr radius: AUTO
  spr subtree cutoff: 1.000000
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX2
  parallelization: PTHREADS (12 threads), thread pinning: OFF

WARNING: The model you specified on the command line (BIN+G) will be ignored
         since the binary MSA file already contains a model definition.
         If you want to change the model, please re-run RAxML-NG
         with the original PHYLIP/FASTA alignment and --redo option.

[00:00:00] Loading binary alignment from file: T1.raxml.rba
[00:00:00] Alignment comprises 15903 taxa, 1 partitions and 80043 patterns

Partition 0: noname
Model: BIN+FO+G4m
Alignment sites / patterns: 662598 / 80043
Gaps: 0.00 %
Invariant sites: 0.00 %

NOTE: Per-rate scalers were automatically enabled to prevent numerical issues on taxa-rich alignments.
NOTE: You can use --force switch to skip this check and fall back to per-site scalers.

[00:00:00] Generating 1 parsimony starting tree(s) with 15903 taxa
[21:59:48] Data distribution: max. partitions/sites/weight per thread: 1 / 6671 / 53368

Starting ML tree search with 1 distinct starting trees

[21:59:59 -293414059.857668] Initial branch length optimization
[22:11:54 -12870904.568663] Model parameter optimization (eps = 10.000000)
[22:48:10 -11159847.070915] AUTODETECT spr round 1 (radius: 5)
[30:53:41 -10960549.341324] AUTODETECT spr round 2 (radius: 10)
[40:22:03 -10901423.348361] AUTODETECT spr round 3 (radius: 15)
[53:04:58 -10872535.743633] AUTODETECT spr round 4 (radius: 20)
[70:53:13 -10863029.446000] AUTODETECT spr round 5 (radius: 25)
[95:40:29 -10856818.179611] SPR radius for FAST iterations: 25 (autodetect)
[95:40:29 -10856818.179611] Model parameter optimization (eps = 3.000000)
[95:46:28 -10856590.499556] FAST spr round 1 (radius: 25)
[133:43:44 -10792307.197406] FAST spr round 2 (radius: 25)
[163:27:18 -10786562.711287] FAST spr round 3 (radius: 25)
[182:56:49 -10785582.785149] FAST spr round 4 (radius: 25)
[196:40:56 -10785428.784612] FAST spr round 5 (radius: 25)
[207:14:34 -10785405.816207] FAST spr round 6 (radius: 25)
[216:20:29 -10785393.740411] FAST spr round 7 (radius: 25)
[224:46:30 -10785389.165860] FAST spr round 8 (radius: 25)
[232:51:05 -10785388.061056] FAST spr round 9 (radius: 25)
[240:50:38 -10785350.752960] FAST spr round 10 (radius: 25)
[248:43:41 -10785343.003572] FAST spr round 11 (radius: 25)
[256:34:45 -10785335.336172] FAST spr round 12 (radius: 25)
[264:23:10 -10785334.672494] FAST spr round 13 (radius: 25)
[272:09:23 -10785334.442445] FAST spr round 14 (radius: 25)
[279:55:25 -10785334.402835] Model parameter optimization (eps = 1.000000)
[279:58:49 -10785316.027916] SLOW spr round 1 (radius: 5)
[289:12:36 -10783345.518830] SLOW spr round 2 (radius: 5)
[298:27:23 -10783008.268931] SLOW spr round 3 (radius: 5)
[307:49:31 -10782906.637499] SLOW spr round 4 (radius: 5)
[316:47:08 -10782883.723240] SLOW spr round 5 (radius: 5)
[330:59:42 -10782875.460545] SLOW spr round 6 (radius: 5)
[344:04:37 -10782875.460525] SLOW spr round 7 (radius: 10)
[354:20:54 -10782558.813202] SLOW spr round 8 (radius: 5)
[365:25:33 -10782506.848746] SLOW spr round 9 (radius: 5)
[375:41:18 -10782495.488195] SLOW spr round 10 (radius: 5)
[385:26:13 -10782495.477390] SLOW spr round 11 (radius: 10)
[395:18:23 -10782481.074738] SLOW spr round 12 (radius: 5)
[406:00:42 -10782470.907979] SLOW spr round 13 (radius: 5)
[415:58:35 -10782469.273414] SLOW spr round 14 (radius: 5)
[425:43:04 -10782469.273414] SLOW spr round 15 (radius: 10)
[435:43:24 -10782458.594191] SLOW spr round 16 (radius: 5)
[446:36:51 -10782450.948038] SLOW spr round 17 (radius: 5)
[456:43:42 -10782449.989584] SLOW spr round 18 (radius: 5)
[466:28:04 -10782449.989584] SLOW spr round 19 (radius: 10)
[476:26:51 -10782446.537403] SLOW spr round 20 (radius: 5)
[488:34:36 -10782443.161471] SLOW spr round 21 (radius: 5)
[500:27:02 -10782443.161471] SLOW spr round 22 (radius: 10)
[512:19:39 -10782440.813544] SLOW spr round 23 (radius: 5)
[523:42:59 -10782436.723342] SLOW spr round 24 (radius: 5)
[533:39:18 -10782436.723342] SLOW spr round 25 (radius: 10)
[543:49:27 -10782434.596504] SLOW spr round 26 (radius: 5)
[554:25:24 -10782434.596492] SLOW spr round 27 (radius: 10)
[565:06:39 -10782424.166806] SLOW spr round 28 (radius: 5)
[575:43:01 -10782392.405001] SLOW spr round 29 (radius: 5)
[585:39:00 -10782390.716591] SLOW spr round 30 (radius: 5)
[595:12:25 -10782390.604728] SLOW spr round 31 (radius: 5)
[604:37:37 -10782390.376568] SLOW spr round 32 (radius: 5)
[613:59:51 -10782390.376568] SLOW spr round 33 (radius: 10)
[623:52:03 -10782389.031396] SLOW spr round 34 (radius: 5)
[634:36:41 -10782379.328726] SLOW spr round 35 (radius: 5)
[644:32:39 -10782379.328726] SLOW spr round 36 (radius: 10)
[654:40:56 -10782378.520667] SLOW spr round 37 (radius: 5)
[665:25:05 -10782377.394202] SLOW spr round 38 (radius: 5)
[675:18:40 -10782377.394202] SLOW spr round 39 (radius: 10)
[685:26:27 -10782377.199840] SLOW spr round 40 (radius: 5)
[696:03:13 -10782375.077363] SLOW spr round 41 (radius: 5)
[706:06:08 -10782375.077346] SLOW spr round 42 (radius: 10)
[716:18:43 -10782374.954318] SLOW spr round 43 (radius: 5)
[726:59:50 -10782374.941518] SLOW spr round 44 (radius: 10)
[737:48:17 -10782373.376799] SLOW spr round 45 (radius: 5)
[748:32:38 -10782369.325228] SLOW spr round 46 (radius: 5)
[759:22:38 -10782369.125301] SLOW spr round 47 (radius: 5)
[769:02:20 -10782369.057493] SLOW spr round 48 (radius: 10)
[778:58:29 -10782368.987702] SLOW spr round 49 (radius: 15)
[794:25:23 -10782293.701680] SLOW spr round 50 (radius: 5)
[805:48:31 -10782211.355814] SLOW spr round 51 (radius: 5)
[815:55:46 -10782184.055114] SLOW spr round 52 (radius: 5)
[825:33:20 -10782184.055114] SLOW spr round 53 (radius: 10)
[836:31:23 -10782179.389076] SLOW spr round 54 (radius: 5)
[850:24:58 -10782157.801604] SLOW spr round 55 (radius: 5)
[860:22:36 -10782152.052412] SLOW spr round 56 (radius: 5)
[870:04:10 -10782152.052412] SLOW spr round 57 (radius: 10)
[880:02:01 -10782089.037641] SLOW spr round 58 (radius: 5)
[890:53:07 -10782058.361742] SLOW spr round 59 (radius: 5)
[900:56:55 -10782053.848090] SLOW spr round 60 (radius: 5)
[910:36:35 -10782053.847896] SLOW spr round 61 (radius: 10)
[920:32:54 -10782053.847896] SLOW spr round 62 (radius: 15)
[935:57:11 -10782038.437947] SLOW spr round 63 (radius: 5)
[947:02:05 -10782010.639652] SLOW spr round 64 (radius: 5)
[957:16:33 -10782010.639652] SLOW spr round 65 (radius: 10)
[967:32:52 -10782010.639652] SLOW spr round 66 (radius: 15)
[982:26:50 -10782006.792999] SLOW spr round 67 (radius: 5)
[993:31:42 -10782006.684045] SLOW spr round 68 (radius: 5)
[1003:45:53 -10782006.684045] SLOW spr round 69 (radius: 10)
[1014:01:44 -10782006.684045] SLOW spr round 70 (radius: 15)
[1028:54:11 -10782005.482323] SLOW spr round 71 (radius: 5)
[1040:01:15 -10782004.058443] SLOW spr round 72 (radius: 5)
[1050:15:40 -10782004.058443] SLOW spr round 73 (radius: 10)
[1060:32:01 -10782004.058443] SLOW spr round 74 (radius: 15)
[1075:24:50 -10782003.320979] SLOW spr round 75 (radius: 5)
[1086:29:52 -10782003.320979] SLOW spr round 76 (radius: 10)
[1097:44:30 -10782003.320979] SLOW spr round 77 (radius: 15)
[1112:03:34 -10782002.589903] SLOW spr round 78 (radius: 5)
[1123:09:28 -10782002.589903] SLOW spr round 79 (radius: 10)
[1134:26:34 -10782002.589903] SLOW spr round 80 (radius: 15)
[1148:43:59 -10782002.589903] SLOW spr round 81 (radius: 20)
[1176:17:33 -10780111.340019] SLOW spr round 82 (radius: 5)
[1187:57:24 -10778481.517401] SLOW spr round 83 (radius: 5)
[1198:30:53 -10778386.490948] SLOW spr round 84 (radius: 5)
[1208:20:38 -10778385.792395] SLOW spr round 85 (radius: 5)
[1217:56:26 -10778385.792395] SLOW spr round 86 (radius: 10)
[1227:51:06 -10778385.792395] SLOW spr round 87 (radius: 15)
[1243:17:22 -10778385.005615] SLOW spr round 88 (radius: 5)
[1254:20:13 -10778377.064104] SLOW spr round 89 (radius: 5)
[1264:33:38 -10778377.064104] SLOW spr round 90 (radius: 10)
[1274:47:52 -10778377.064104] SLOW spr round 91 (radius: 15)
[1289:34:38 -10778376.050245] SLOW spr round 92 (radius: 5)
[1300:34:46 -10778373.543558] SLOW spr round 93 (radius: 5)
[1310:47:23 -10778373.543558] SLOW spr round 94 (radius: 10)
[1321:01:37 -10778373.543558] SLOW spr round 95 (radius: 15)
[1335:45:38 -10778373.543558] SLOW spr round 96 (radius: 20)
[1361:47:00 -10778123.945683] SLOW spr round 97 (radius: 5)
[1373:16:06 -10777558.955704] SLOW spr round 98 (radius: 5)
[1383:44:34 -10777552.987721] SLOW spr round 99 (radius: 5)
[1393:40:00 -10777551.158357] SLOW spr round 100 (radius: 5)
[1403:13:10 -10777551.158352] SLOW spr round 101 (radius: 10)
[1413:05:41 -10777551.158352] SLOW spr round 102 (radius: 15)

RAxML-NG v. 0.9.0 released on 20.05.2019 by The Exelixis Lab.
Developed by: Alexey M. Kozlov and Alexandros Stamatakis.
Contributors: Diego Darriba, Tomas Flouri, Benoit Morel, Sarah Lutteropp, Ben Bettisworth.
Latest version: https://github.com/amkozlov/raxml-ng
Questions/problems/suggestions? Please visit: https://groups.google.com/forum/#!forum/raxml

RAxML-NG was called at 10-Jan-2022 08:22:39 as follows:

raxml-ng --msa T1.raxml.rba --model BIN+G --threads 12 --tree pars{1} --seed 12345 --outgroup ERR2512533 --prefix T5

Analysis options:
  run mode: ML tree search
  start tree(s): parsimony (1)
  random seed: 12345
  tip-inner: OFF
  pattern compression: ON
  per-rate scalers: OFF
  site repeats: ON
  fast spr radius: AUTO
  spr subtree cutoff: 1.000000
  branch lengths: proportional (ML estimate, algorithm: NR-FAST)
  SIMD kernels: AVX2
  parallelization: PTHREADS (12 threads), thread pinning: OFF

WARNING: The model you specified on the command line (BIN+G) will be ignored
         since the binary MSA file already contains a model definition.
         If you want to change the model, please re-run RAxML-NG
         with the original PHYLIP/FASTA alignment and --redo option.

[00:00:00] Loading binary alignment from file: T1.raxml.rba
[00:00:01] Alignment comprises 15903 taxa, 1 partitions and 80043 patterns

Partition 0: noname
Model: BIN+FO+G4m
Alignment sites / patterns: 662598 / 80043
Gaps: 0.00 %
Invariant sites: 0.00 %

NOTE: Per-rate scalers were automatically enabled to prevent numerical issues on taxa-rich alignments.
NOTE: You can use --force switch to skip this check and fall back to per-site scalers.

[00:00:01] NOTE: Resuming execution from checkpoint (logLH: -10777551.16, ML trees: 0, bootstraps: 0)
[00:00:01] Data distribution: max. partitions/sites/weight per thread: 1 / 6671 / 53368

Starting ML tree search with 1 distinct starting trees

[00:00:12 -10777551.158352] SPR radius for FAST iterations: 25 (autodetect)
[00:00:12 -10777551.158352] SLOW spr round 102 (radius: 15)
amkozlov commented 2 years ago

ok I see, you can enforce faster convergence by increasing the epsilon, eg --lh-epsilon 1

also, please update to the latest raxml-ng at your earliest convenience, since v0.9 is very old

gaetansnl commented 2 years ago

@amkozlov Thank you for your help. i'm trying this solution

gaetansnl commented 2 years ago

@amkozlov Is it possible to open 0.9 backups with 1.1.0 version ? Because it seems I can't succesfully restard with 0.9

amkozlov commented 2 years ago

hm I'm not 100% sure it works, so why don't you just try it out? :)

gaetansnl commented 2 years ago

I tried. But it seems to stard computing a new tree : /

amkozlov commented 2 years ago

ok then probably the checkpoint file format has changed. So you can finish this run with v0.9 and use v1.1 for your next analysis.

gaetansnl commented 2 years ago

The problem is that when I rerun with 0.9, it crash after 1 or 2 hours. Is there annother way to convert the checkpoint file to a tree ? This little file worth 2 month of computing 😄

amkozlov commented 2 years ago

hm does it fail after restarting witjh --lh-epsilon 1?

In order to extract tree from a checkpoint, you would need to modify the source code of v0.9 and add something like

    NewickStream ns1("currentTree.nw");
    ns1 << _checkp.tree;

in here:

https://github.com/amkozlov/raxml-ng/blob/0.9.0/src/Checkpoint.cpp#L45

gaetansnl commented 2 years ago

@amkozlov Thank you for your answer. It fails with any config (it seems it is at the end of the computing), it also fails on >v1. I work on this project only the monday, sorry for the late response. I will try to debug why it is failing

I'm trying to compile 0.9.0 and I get this error. I also tried to compile the last version and it works, but not with 0.9.0.

/home/gaetan/raxmlng-build/raxml-ng/src/TreeInfo.cpp: In member function ‘double TreeInfo::optimize_params(int, double)’:
/home/gaetan/raxmlng-build/raxml-ng/src/TreeInfo.cpp:341:78: error: too few arguments to function ‘double pllmod_algo_opt_rates_weights_treeinfo(pllmod_treeinfo_t*, double, double, double, double, double, double)’
  341 |                                                           RAXML_PARAM_EPSILON);
      |                                                                              ^
In file included from /home/gaetan/raxmlng-build/raxml-ng/src/common.h:18,
                 from /home/gaetan/raxmlng-build/raxml-ng/src/TreeInfo.hpp:4,
                 from /home/gaetan/raxmlng-build/raxml-ng/src/TreeInfo.cpp:3:
/home/gaetan/raxmlng-build/raxml-ng/build/localdeps/include/libpll/pllmod_algorithm.h:168:8: note: declared here
  168 | double pllmod_algo_opt_rates_weights_treeinfo (pllmod_treeinfo_t * treeinfo,
      |        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/gaetan/raxmlng-build/raxml-ng/src/TreeInfo.cpp: In member function ‘void TreeInfo::set_topology_constraint(const Tree&)’:
/home/gaetan/raxmlng-build/raxml-ng/src/TreeInfo.cpp:387:91: error: too few arguments to function ‘int pllmod_treeinfo_set_constraint_tree(pllmod_treeinfo_t*, const pll_utree_t*, int)’
  387 |     int retval = pllmod_treeinfo_set_constraint_tree(_pll_treeinfo, &cons_tree.pll_utree());
      |                                                                                           ^
In file included from /home/gaetan/raxmlng-build/raxml-ng/src/common.h:16,
                 from /home/gaetan/raxmlng-build/raxml-ng/src/TreeInfo.hpp:4,
                 from /home/gaetan/raxmlng-build/raxml-ng/src/TreeInfo.cpp:3:
/home/gaetan/raxmlng-build/raxml-ng/build/localdeps/include/libpll/pll_tree.h:739:16: note: declared here
  739 | PLL_EXPORT int pllmod_treeinfo_set_constraint_tree(pllmod_treeinfo_t * treeinfo,
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[2]: *** [src/CMakeFiles/raxml_module.dir/build.make:245: src/CMakeFiles/raxml_module.dir/TreeInfo.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:786: src/CMakeFiles/raxml_module.dir/all] Error 2
make: *** [Makefile:141: all] Error 2
amkozlov commented 2 years ago

seems like you forgot to update submodules after switching to v0.9 (git submodule update --recursive).

even better, just clone v0.9 tag into a separate directory.

git clone --recursive -b 0.9.0 https://github.com/amkozlov/raxml-ng raxml-ng-v09

gaetansnl commented 2 years ago

@amkozlov Thank you so much !!! I managed to get the tree ! Here is the error I got. I thought, maybe it's because the outgroup I use isn't in the tree (I forgot it) ? image

amkozlov commented 2 years ago

ok great! so can we close this one since your original problem (getting tree from a checkpoint) has been solved?

gaetansnl commented 2 years ago

Yes thank you !