BenoitMorel / GeneRax

GNU Affero General Public License v3.0
65 stars 11 forks source link

Checkpoints to restart an optimization step if it gets interrupted #66

Open AAleotti opened 1 year ago

AAleotti commented 1 year ago

Hi!

I have been encountering the following issue:

I am running generax on my university's high performance computing platform where I have a maximum amount of time I am allowed to let the job running. Since my jobs run for a very long time, they often get interrupted. When I start a new job I notice that if an optimization step had managed to finish before the job was interrupted then generax "saves" that work. However, if an optimization step was interrupted, the entire step starts from scratch.

I noticed, for example, that if the job interrupts while generax is in the middle of gene_optimization_3, even if I see that in /gene_optimization_3/running_jobs/_families_out.txt there seems to be a lot of saved rounds of optimization, when the job starts, the file will start re-writing from scratch.

I was wondering if I have missed something regarding how to re-start a generax run after interruption in such a way that it can pick-up from where it left and not at the beginning of the latest optimization step? Or if there is any work around to this problem?

I am using GeneRax v1.2.3 installed with conda.

When I re-start a job I use the same command as previously used.

This is an example job submission script with the generax command that I use:

!/bin/bash

PBS -N Generax_CYP_rad3

PBS -l walltime=500:00:00

PBS -l vmem=78gb

PBS -m bea

PBS -M aa1176@le.ac.uk

PBS -l nodes=1:ppn=28

Set OMP_NUM_THREADS for OpenMP jobs

export OMP_NUM_THREADS=$PBS_NUM_PPN

cd $PBS_O_WORKDIR

. /data/evassvis/software/anaconda3/etc/profile.d/conda.sh

conda activate generax_env

mpiexec -np 28 generax -f Families_CYP.txt -s species_tree.newick --unrooted-gene-tree -r UndatedDL --max-spr-radius 3 -p results_CYP

Many thanks!! Alessandra

BenoitMorel commented 1 year ago

Hi Alessandra, I just gave it a try, because I implemented this a long time ago :-) The checkpoint is automatic, so you're not doing anything wrong. For a given step (e..g. gene_optimization_3), GeneRax won't process a family that was already processed again. But if a family was being processed at the time of the interruption, GeneRax will restart the computations from scratch for this family. It might be problematic for you if you have a few very large families. If you are in this case, let me know. I'll try to find the time to add one additional level of checkpointing. Best, Benoit

Le jeu. 10 août 2023, 16:07, AAleotti @.***> a écrit :

Hi!

I have been encountering the following issue:

I am running generax on my university's high performance computing platform where I have a maximum amount of time I am allowed to let the job running. Since my jobs run for a very long time, they often get interrupted. When I start a new job I notice that if an optimization step had managed to finish before the job was interrupted then generax "saves" that work. However, if an optimization step was interrupted, the entire step starts from scratch.

I noticed, for example, that if the job interrupts while generax is in the middle of gene_optimization_3, even if I see that in /gene_optimization_3/running_jobs/_families_out.txt there seems to be a lot of saved rounds of optimization, when the job starts, the file will start re-writing from scratch.

I was wondering if I have missed something regarding how to re-start a generax run after interruption in such a way that it can pick-up from where it left and not at the beginning of the latest optimization step? Or if there is any work around to this problem?

I am using GeneRax v1.2.3 installed with conda.

When I re-start a job I use the same command as previously used.

This is an example job submission script with the generax command that I use:

!/bin/bash

PBS -N Generax_CYP_rad3

PBS -l walltime=500:00:00

PBS -l vmem=78gb

PBS -m bea

PBS -M @.***

PBS -l nodes=1:ppn=28

Set OMP_NUM_THREADS for OpenMP jobs

export OMP_NUM_THREADS=$PBS_NUM_PPN

cd $PBS_O_WORKDIR

. /data/evassvis/software/anaconda3/etc/profile.d/conda.sh

conda activate generax_env

mpiexec -np 28 generax -f Families_CYP.txt -s species_tree.newick --unrooted-gene-tree -r UndatedDL --max-spr-radius 3 -p results_CYP

Many thanks!! Alessandra

— Reply to this email directly, view it on GitHub https://github.com/BenoitMorel/GeneRax/issues/66, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJ3UNDAIQ6GMC7V7FXJ63XUTTKBANCNFSM6AAAAAA3LSBIQQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AAleotti commented 1 year ago

Hi Benoit!

First of all thanks so much for your reply :) Yes exactly, what you say seems to correspond to what I have noticed. The problem for me is that I am trying to run generax on a very large family (>4000 sequences), and I am running only that one family. So I cannot for example reduce the amount of families. Therefore, if indeed by any chance you were to find the time to add extra checkpoints for this, it would greatly help me and I would be extremely grateful. :) :)

Thanks again, All best, Alessandra

BenoitMorel commented 1 year ago

Ok. I hope I can work on this in the next two weeks. I'll let you know!

Le ven. 11 août 2023 à 23:29, AAleotti @.***> a écrit :

Hi Benoit!

First of all thanks so much for your reply :) Yes exactly, what you say seems to correspond to what I have noticed. The problem for me is that I am trying to run generax on a very large family (>4000 sequences), and I am running only that one family. So I cannot for example reduce the amount of families. Therefore, if indeed by any chance you were to find the time to add extra checkpoints for this, it would greatly help me and I would be extremely grateful. :) :)

Thanks again, All best, Alessandra

— Reply to this email directly, view it on GitHub https://github.com/BenoitMorel/GeneRax/issues/66#issuecomment-1675420572, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJ3UKFK2CO42ZALNFIZ6LXU2P2VANCNFSM6AAAAAA3LSBIQQ . You are receiving this because you commented.Message ID: @.***>

AAleotti commented 1 year ago

Thanks a lot!

BenoitMorel commented 1 year ago

Hi Alessandra, I have added one level of checkpoint: the current state is saved after each round of SPR moves, which is every time you see this kind of log: "SPR Search with radius 1: trying 1494 prune nodes" Let me know if that's enough. I can also try to save the checkpoint more often, but I don't want to save them too frequently (there could be some problems if someone stops the run exactly when the checkpoint is being saved, which is quite unlikely in the current state). You have to update generax with: ./gitpull.sh ./install.sh Also, you might have to restart your current run from scratch... (I'm not sure, you can also try to continue your current run, it will crash if that's not compatible with the new checkpoint system :)) Just let me know how it goes, because it's not that easy to test extensively... Best, Benoit

AAleotti commented 1 year ago

Hi Benoit - thanks a lot for working on this!

I will give it a go and let you know - a checkpoint after each round of SPR moves sounds like it should definitely be enough in my case but will update you.

Just a quick ignorant question: I had originally downloaded generax with conda - by any chance would it work if I update to the latest version through conda? Or is it better if I re-install generax through git clone and then do the ./gitpull.sh and ./install.sh steps?

Thanks! Alessandra

BenoitMorel commented 1 year ago

That's not an ignorant question :-) No, that won't work if you update with conda because the fix is not in bionconda yet. I would rather install generax with git. It should work quite well if you follow the instructions from the readme, but I'd be happy to help if you encounter any issues.

Le lun. 21 août 2023 à 16:45, AAleotti @.***> a écrit :

Hi Benoit - thanks a lot for working on this!

I will give it a go and let you know - a checkpoint after each round of SPR moves sounds like it should definitely be enough in my case but will update you.

Just a quick ignorant question: I had originally downloaded github with conda - by any chance would it work if I update to the latest version through conda? Or is it better if I re-install generax through git clone and then do the ./gitpull.sh and ./install.sh steps?

Thanks! Alessandra

— Reply to this email directly, view it on GitHub https://github.com/BenoitMorel/GeneRax/issues/66#issuecomment-1686471176, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJ3UPQWVSH2TYNS7XT4YLXWNYBBANCNFSM6AAAAAA3LSBIQQ . You are receiving this because you commented.Message ID: @.***>

AAleotti commented 1 year ago

Hi Benoit, appologies for the late reply.

Here are my updates:

I am now using a clone of the latest generax instead of conda.

I restarted generax for the first time in the same directory where it was running previously. It started from scratch the latest optimization step as you had mentioned might happened (previous optimizations that had finished were saved). Then I interrupted the run after it had gone through some first SPR search and started again and from my understanding of the output file it is indeed starting from a saved checkpoint.

Here is how the *families_out.txt file looked before I interrupted the run:

mpi-scheduler optimizeGeneTrees results_CYP/results/CYP_families/geneTree.newick NONE CYP.fasta.cdhit.mafft.trimal results_CYP/species_trees/inferred_species_tree.newick LG+F+R10 results_CYP/gene_optimization_4/dtl_rates.txt 0 0 1 UNIFORM 0 0 0 0 PARENTS 0 0 0.000001 NONE 0 2 -1 1 1 1 5 results_CYP/results/CYP_families/geneTree.newick results_CYP/results/CYP_families/stats.txt 0 results_CYP/gene_optimization_4/checkpoints/CYP_families [00:02:53] Starting optimizeGeneTreesSlave LibpllModel LG+F+R10 [00:02:53] Starting optimizing gene tree Number of ranks 19 (0.711408, 0.779363, score = 0) Taxa number: 4499 joint: -1.97192e+06 libpll: -1.95907e+06 reconciliation: -12850.8 Initial ll = -1.97192e+06 [05:13:47] SPR Search with radius 5: trying 13491 prune nodes [49:29:21] Found 12 potential better moves [49:29:21] GeneRax will now test and apply all the 12 potential good moves one after each other Applying move, ll = -1.97191e+06, 5206 5387 Applying move, ll = -1.97191e+06, 7536 1396 Applying move, ll = -1.97191e+06, 10678 2070 Applying move, ll = -1.9719e+06, 11367 11423 Applying move, ll = -1.9719e+06, 12214 12017 Applying move, ll = -1.97189e+06, 13635 13520 Applying move, ll = -1.97189e+06, 13953 3211 Applying move, ll = -1.97189e+06, 14383 14348 Applying move, ll = -1.97189e+06, 15172 15320 Applying move, ll = -1.97189e+06, 15672 17609 Applying move, ll = -1.97189e+06, 17716 17987 [49:34:09] SPR Search with radius 5: trying 13491 prune nodes [90:27:33] Found 46 potential better moves [90:27:33] GeneRax will now test and apply all the 46 potential good moves one after each other Applying move, ll = -1.97188e+06, 8680 14378 Applying move, ll = -1.97188e+06, 10674 2070 Applying move, ll = -1.97188e+06, 11355 11365 Applying move, ll = -1.97188e+06, 11366 11420 Worse likelihood, Move rejected Applying move, ll = -1.97188e+06, 11423 11399 Worse likelihood, Move rejected Applying move, ll = -1.97188e+06, 12034 12374 Applying move, ll = -1.97188e+06, 12294 12362 Applying move, ll = -1.97188e+06, 13518 13625 Applying move, ll = -1.97188e+06, 13630 13511 Worse likelihood, Move rejected Applying move, ll = -1.97187e+06, 13733 13661 Worse likelihood, Move rejected Worse likelihood, Move rejected Applying move, ll = -1.97187e+06, 13761 14356 Worse likelihood, Move rejected Applying move, ll = -1.97187e+06, 14350 3303 Applying move, ll = -1.97187e+06, 15637 15515 Applying move, ll = -1.97187e+06, 15671 17606 Applying move, ll = -1.97187e+06, 17716 17777 Applying move, ll = -1.97187e+06, 17718 4412 [90:38:06] SPR Search with radius 5: trying 13491 prune nodes

And here it is now after just re-starating the run:

mpi-scheduler optimizeGeneTrees results_CYP/results/CYP_families/geneTree.newick NONE CYP.fasta.cdhit.mafft.trimal results_CYP/species_trees/inferred_species_tree.newick LG+F+R10 results_CYP/gene_optimization_4/dtl_rates.txt 0 0 1 UNIFORM 0 0 0 0 PARENTS 0 0 0.000001 NONE 0 2 -1 1 1 1 5 results_CYP/results/CYP_families/geneTree.newick results_CYP/results/CYP_families/stats.txt 0 results_CYP/gene_optimization_4/checkpoints/CYP_families checkpoint exists optimizeGeneTreesSlave LibpllModel LG+F+R10 [00:02:40] Starting optimizing gene tree Number of ranks 29 checkpoint exists using model LG+FC+R10{0.0478958/0.150934/0.266915/0.399573/0.554533/0.740872/0.974702/1.28916/1.77339/3.0563}{0.0265746/0.0180415/0.0187476/0.0716185/0.123774/0.181127/0.163649/0.223024/0.169587/0.00385699} (0.711408, 0.779363, score = 0) Taxa number: 4499 Checkpoint exists, skipping parameter optimization and starting from last best gene tree joint: -1.97187e+06 libpll: -1.95905e+06 reconciliation: -12818.9 Initial ll = -1.97187e+06 [00:03:33] SPR Search with radius 5: trying 13491 prune nodes

Is this how it should look?

Thanks! Alessandra

BenoitMorel commented 1 year ago

Yes exactly! It restarts from a tree with the same likelihood so that looks great :-)

Le ven. 25 août 2023, 18:43, AAleotti @.***> a écrit :

Hi Benoit, appologies for the late reply.

Here are my updates:

I am now using a clone of the latest generax instead of conda.

I restarted generax for the first time in the same directory where it was running previously. It started from scratch the latest optimization step as you had mentioned might happened (previous optimizations that had finished were saved). Then I interrupted the run after it had gone through some first SPR search and started again and from my understanding of the output file it is indeed starting from a saved checkpoint.

Here is how the *families_out.txt file looked before I interrupted the run:

mpi-scheduler optimizeGeneTrees results_CYP/results/CYP_families/geneTree.newick NONE CYP.fasta.cdhit.mafft.trimal results_CYP/species_trees/inferred_species_tree.newick LG+F+R10 results_CYP/gene_optimization_4/dtl_rates.txt 0 0 1 UNIFORM 0 0 0 0 PARENTS 0 0 0.000001 NONE 0 2 -1 1 1 1 5 results_CYP/results/CYP_families/geneTree.newick results_CYP/results/CYP_families/stats.txt 0 results_CYP/gene_optimization_4/checkpoints/CYP_families [00:02:53] Starting optimizeGeneTreesSlave LibpllModel LG+F+R10 [00:02:53] Starting optimizing gene tree Number of ranks 19 (0.711408, 0.779363, score = 0) Taxa number: 4499 joint: -1.97192e+06 libpll: -1.95907e+06 reconciliation: -12850.8 Initial ll = -1.97192e+06 [05:13:47] SPR Search with radius 5: trying 13491 prune nodes [49:29:21] Found 12 potential better moves [49:29:21] GeneRax will now test and apply all the 12 potential good moves one after each other Applying move, ll = -1.97191e+06, 5206 5387 Applying move, ll = -1.97191e+06, 7536 1396 Applying move, ll = -1.97191e+06, 10678 2070 Applying move, ll = -1.9719e+06, 11367 11423 Applying move, ll = -1.9719e+06, 12214 12017 Applying move, ll = -1.97189e+06, 13635 13520 Applying move, ll = -1.97189e+06, 13953 3211 Applying move, ll = -1.97189e+06, 14383 14348 Applying move, ll = -1.97189e+06, 15172 15320 Applying move, ll = -1.97189e+06, 15672 17609 Applying move, ll = -1.97189e+06, 17716 17987 [49:34:09] SPR Search with radius 5: trying 13491 prune nodes [90:27:33] Found 46 potential better moves [90:27:33] GeneRax will now test and apply all the 46 potential good moves one after each other Applying move, ll = -1.97188e+06, 8680 14378 Applying move, ll = -1.97188e+06, 10674 2070 Applying move, ll = -1.97188e+06, 11355 11365 Applying move, ll = -1.97188e+06, 11366 11420 Worse likelihood, Move rejected Applying move, ll = -1.97188e+06, 11423 11399 Worse likelihood, Move rejected Applying move, ll = -1.97188e+06, 12034 12374 Applying move, ll = -1.97188e+06, 12294 12362 Applying move, ll = -1.97188e+06, 13518 13625 Applying move, ll = -1.97188e+06, 13630 13511 Worse likelihood, Move rejected Applying move, ll = -1.97187e+06, 13733 13661 Worse likelihood, Move rejected Worse likelihood, Move rejected Applying move, ll = -1.97187e+06, 13761 14356 Worse likelihood, Move rejected Applying move, ll = -1.97187e+06, 14350 3303 Applying move, ll = -1.97187e+06, 15637 15515 Applying move, ll = -1.97187e+06, 15671 17606 Applying move, ll = -1.97187e+06, 17716 17777 Applying move, ll = -1.97187e+06, 17718 4412 [90:38:06] SPR Search with radius 5: trying 13491 prune nodes

And here it is now after just re-starating the run:

mpi-scheduler optimizeGeneTrees results_CYP/results/CYP_families/geneTree.newick NONE CYP.fasta.cdhit.mafft.trimal results_CYP/species_trees/inferred_species_tree.newick LG+F+R10 results_CYP/gene_optimization_4/dtl_rates.txt 0 0 1 UNIFORM 0 0 0 0 PARENTS 0 0 0.000001 NONE 0 2 -1 1 1 1 5 results_CYP/results/CYP_families/geneTree.newick results_CYP/results/CYP_families/stats.txt 0 results_CYP/gene_optimization_4/checkpoints/CYP_families checkpoint exists optimizeGeneTreesSlave LibpllModel LG+F+R10 [00:02:40] Starting optimizing gene tree Number of ranks 29 checkpoint exists using model LG+FC+R10{0.0478958/0.150934/0.266915/0.399573/0.554533/0.740872/0.974702/1.28916/1.77339/3.0563}{0.0265746/0.0180415/0.0187476/0.0716185/0.123774/0.181127/0.163649/0.223024/0.169587/0.00385699} (0.711408, 0.779363, score = 0) Taxa number: 4499 Checkpoint exists, skipping parameter optimization and starting from last best gene tree joint: -1.97187e+06 libpll: -1.95905e+06 reconciliation: -12818.9 Initial ll = -1.97187e+06 [00:03:33] SPR Search with radius 5: trying 13491 prune nodes

Is this how it should look?

Thanks! Alessandra

— Reply to this email directly, view it on GitHub https://github.com/BenoitMorel/GeneRax/issues/66#issuecomment-1693639266, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJ3UJJX43N3AV7T22RT2LXXDIZ3ANCNFSM6AAAAAA3LSBIQQ . You are receiving this because you commented.Message ID: @.***>

AAleotti commented 1 year ago

Perfect!

Thanks a lot for solving this issue!

Have a great day :)