amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
379 stars 64 forks source link

Core dumped when run restarted after all bootstraps were done #57

Closed terrycojones closed 2 years ago

terrycojones commented 5 years ago

I did a raxml-ng (0.7.0 BETA running on Linux) run tonight that I had to restart a couple of times. On the second restart the run got almost all the way through but ran out of memory and was killed by the (SLURM) job control system. The run used --threads 1 --bs-trees 1000 and the output log ended with:

[03:21:46] Bootstrap tree #997, logLikelihood: -81089.162390
[03:21:54] Bootstrap tree #998, logLikelihood: -81794.630050
[03:22:03] Bootstrap tree #999, logLikelihood: -82436.916214
[03:22:13] Bootstrap tree #1000, logLikelihood: -81751.234543
/var/spool/slurm/slurmd/job7162547/slurm_script: line 9: 31189 Killed                  raxml-ng --threads 1 --bs-trees 1000 --all --msa L-realigned-by-translation-2.fasta --model GTR --data-type DNA
slurmstepd: error: Exceeded step memory limit at some point.
slurmstepd: error: Exceeded job memory limit at some point.

There was a .ckp file in place, so I just restarted the job. That resulted in a core dump. The log file shows the SEGV:

[00:00:01] NOTE: Resuming execution from checkpoint (logLH: -81751.23, ML trees: 20, bootstraps: 1000)
[00:00:01] Data distribution: max. partitions/sites/weight per thread: 1 / 3583 / 14332
/var/spool/slurm/slurmd/job7166485/slurm_script: line 9:  6950 Segmentation fault      (core dumped) raxml-ng --threads 1 --bs-trees 1000 --all --msa L-realigned-by-translation-2.fasta --model GTR --data-type DNA

I ran gdb with the core file:

$ gdb /usr/local/bin/raxml-ng core.6950
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /rds/project/djs200/rds-djs200-acorg/bt/packages/raxml-ng_v0.7.0/raxml-ng...(no debugging symbols found)...done.
[New LWP 6950]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `raxml-ng --threads 1 --bs-trees 1000 --all --msa L-realigned-by-translation-2.f'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000049f1ae in cb_get_splits ()
(gdb) bt
#0  0x000000000049f1ae in cb_get_splits ()
#1  0x0000000000499326 in utree_traverse_apply.part.6 ()
#2  0x000000000049944e in utree_traverse_apply.part.6 ()
#3  0x000000000049944e in utree_traverse_apply.part.6 ()
#4  0x000000000049944e in utree_traverse_apply.part.6 ()
#5  0x0000000000499566 in pllmod_utree_traverse_apply ()
#6  0x000000000049feb5 in pllmod_utree_split_create ()
#7  0x0000000000442f03 in BootstrapTree::add_splits_to_hashtable(pll_unode_s const&, bool) ()
#8  0x0000000000443294 in BootstrapTree::add_bootstrap_tree(Tree const&) ()
#9  0x0000000000462b2d in draw_bootstrap_support(RaxmlInstance&, Tree&, TreeCollection const&) ()
#10 0x000000000046d66f in master_main(RaxmlInstance&, CheckpointManager&) ()
#11 0x000000000046ebd9 in internal_main(int, char**, void*) ()
#12 0x00000000005bed24 in generic_start_main ()
#13 0x00000000005bee62 in __libc_start_main ()
#14 0x00000000004055f6 in _start ()
(gdb)

I'm not sure if can provide the input alignment file or the checkpoint file (I don't know what it contains) as it's not my data. But I could probably compile raxml-ng with -g and run again to get more info on the crash in case that's needed/wanted. But maybe the above is enough to go on?

Any suggestion on whether I can convince raxml-ng to finish the job? I end up with the following output files

$ ls -1 L-realigned-by-translation-2.fasta*
L-realigned-by-translation-2.fasta
L-realigned-by-translation-2.fasta.raxml.ckp
L-realigned-by-translation-2.fasta.raxml.log
L-realigned-by-translation-2.fasta.raxml.rba
L-realigned-by-translation-2.fasta.raxml.startTree

Thanks!

amkozlov commented 5 years ago

Thanks for reporting, I'll look into it.

For now, please try restarting your job with --bootstrap and --search (instead of -all), and then run raxml-ng again with --support to manually map bootstrap support values onto the best ML tree, i.e. something along these lines:

$ cp L-realigned-by-translation-2.fasta.raxml.ckp backup.ckp
$ cp backup.ckp mysearch.raxml.ckp
$ cp backup.ckp myboot.raxml.ckp

$ raxml-ng --bootstrap --threads 1 --bs-trees 1000  --msa L-realigned-by-translation-2.fasta --model GTR --data-type DNA --prefix myboot

$ raxml-ng --search --threads 1 --tree L-realigned-by-translation-2.fasta.raxml.startTree   --msa L-realigned-by-translation-2.fasta --model GTR --data-type DNA --prefix mysearch

$ raxml-ng --support --tree mysearch.raxml.bestTree --bs-trees myboot.raxml.bootstrap --prefix support

hope this works :)

terrycojones commented 5 years ago

OK, will do - thanks @amkozlov

amkozlov commented 5 years ago

@terrycojones : unfortunately, I cannot reproduce the error on my small test dataset, so it would be very helpful if you could provide the alignment (no need to post it here, you can send it to my e-mail).

how many taxa does it have?

terrycojones commented 5 years ago

Hi @amkozlov. The alignment has 25 sequences of length 12162. I re-ran it from scratch last night and there was no issue. So I don't think it's due to the alignment, but rather to the point at which the earlier run got interrupted before the restart. But I'm just making things up, based on it completing the 1000 bootstraps and then wondering if there might be some unanticipated condition in the code where for some reason it can't continue a run that was so close to being done. If I build raxml-ng with debugging turned on I'm not sure if the core file (which I still have) will be valid. Do you know?

amkozlov commented 5 years ago

@terrycojones thanks for the info, I'm not sure if the core file will be still valid. it could be that checkpoint got corrupted. have you tried to resume the run with --search or --bootstrap as I suggest above?

terrycojones commented 5 years ago

Hi @amkozlov I've not gotten back to this yet, sorry! Will try.

amkozlov commented 2 years ago

please feel free to reopen if this still happens with the latest raxml-ng