davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
709 stars 189 forks source link

Inferring remaining multiple sequence alignments and gene trees——This step has been stuck for ten days #921

Open YuJiang0121 opened 3 months ago

YuJiang0121 commented 3 months ago

Hi David, I'm using Orthofinder to analyze 19 species, and the mission is stuck. The supercomputer I rented has been running for a dozen days and still hasn't run 19 species. The commands I use are as follows: “orthofinder -f Dataset -M msa -S diamond -T iqtree -t 64 -a 1 > orthofinder.out 2> orthofinder.err“ There are no errors during operation, how can I improve the speed of its analysis.

My analysis progress log is as follows: ————————————————————-

OrthoFinder version 2.5.5 Copyright (C) 2014 David Emms

2024-07-28 16:20:56 : Starting OrthoFinder 2.5.5 64 thread(s) for highly parallel tasks (BLAST searches etc.) 1 thread(s) for OrthoFinder algorithm

Checking required programs are installed

Test can run "mcl -h" - ok Test can run "mafft" - ok Test can run "iqtree" - ok

Dividing up work for BLAST for parallel processing

2024-07-28 16:21:05 : Creating diamond database 1 of 19 2024-07-28 16:21:05 : Creating diamond database 2 of 19 2024-07-28 16:21:05 : Creating diamond database 3 of 19 2024-07-28 16:21:06 : Creating diamond database 4 of 19 2024-07-28 16:21:06 : Creating diamond database 5 of 19 2024-07-28 16:21:06 : Creating diamond database 6 of 19 2024-07-28 16:21:06 : Creating diamond database 7 of 19 2024-07-28 16:21:06 : Creating diamond database 8 of 19 2024-07-28 16:21:06 : Creating diamond database 9 of 19 2024-07-28 16:21:07 : Creating diamond database 10 of 19 2024-07-28 16:21:07 : Creating diamond database 11 of 19 2024-07-28 16:21:07 : Creating diamond database 12 of 19 2024-07-28 16:21:07 : Creating diamond database 13 of 19 2024-07-28 16:21:07 : Creating diamond database 14 of 19 2024-07-28 16:21:07 : Creating diamond database 15 of 19 2024-07-28 16:21:07 : Creating diamond database 16 of 19 2024-07-28 16:21:08 : Creating diamond database 17 of 19 2024-07-28 16:21:08 : Creating diamond database 18 of 19 2024-07-28 16:21:08 : Creating diamond database 19 of 19

Running diamond all-versus-all

Using 64 thread(s) 2024-07-28 16:21:08 : This may take some time.... 2024-07-28 16:21:08 : Done 0 of 361 2024-07-28 16:23:32 : Done 100 of 361 2024-07-28 16:25:05 : Done 200 of 361 2024-07-28 16:27:30 : Done all-versus-all sequence search

Running OrthoFinder algorithm

2024-07-28 16:27:30 : Initial processing of each species 2024-07-28 16:27:48 : Initial processing of species 0 complete 2024-07-28 16:28:04 : Initial processing of species 1 complete 2024-07-28 16:28:15 : Initial processing of species 2 complete 2024-07-28 16:28:28 : Initial processing of species 3 complete 2024-07-28 16:28:47 : Initial processing of species 4 complete 2024-07-28 16:29:05 : Initial processing of species 5 complete 2024-07-28 16:29:18 : Initial processing of species 6 complete 2024-07-28 16:29:34 : Initial processing of species 7 complete 2024-07-28 16:29:48 : Initial processing of species 8 complete 2024-07-28 16:30:08 : Initial processing of species 9 complete 2024-07-28 16:30:28 : Initial processing of species 10 complete 2024-07-28 16:30:44 : Initial processing of species 11 complete 2024-07-28 16:31:03 : Initial processing of species 12 complete 2024-07-28 16:31:18 : Initial processing of species 13 complete 2024-07-28 16:31:30 : Initial processing of species 14 complete 2024-07-28 16:31:45 : Initial processing of species 15 complete 2024-07-28 16:31:57 : Initial processing of species 16 complete 2024-07-28 16:32:21 : Initial processing of species 17 complete 2024-07-28 16:32:38 : Initial processing of species 18 complete 2024-07-28 16:33:09 : Connected putative homologues 2024-07-28 16:33:12 : Written final scores for species 0 to graph file 2024-07-28 16:33:17 : Written final scores for species 1 to graph file 2024-07-28 16:33:20 : Written final scores for species 2 to graph file 2024-07-28 16:33:23 : Written final scores for species 3 to graph file 2024-07-28 16:33:28 : Written final scores for species 4 to graph file 2024-07-28 16:33:32 : Written final scores for species 5 to graph file 2024-07-28 16:33:36 : Written final scores for species 6 to graph file 2024-07-28 16:33:40 : Written final scores for species 7 to graph file 2024-07-28 16:33:44 : Written final scores for species 8 to graph file 2024-07-28 16:33:49 : Written final scores for species 9 to graph file 2024-07-28 16:33:54 : Written final scores for species 10 to graph file 2024-07-28 16:33:58 : Written final scores for species 11 to graph file 2024-07-28 16:34:03 : Written final scores for species 12 to graph file 2024-07-28 16:34:07 : Written final scores for species 13 to graph file 2024-07-28 16:34:10 : Written final scores for species 14 to graph file 2024-07-28 16:34:15 : Written final scores for species 15 to graph file 2024-07-28 16:34:18 : Written final scores for species 16 to graph file 2024-07-28 16:34:24 : Written final scores for species 17 to graph file 2024-07-28 16:34:28 : Written final scores for species 18 to graph file 2024-07-28 16:38:46 : Ran MCL

Writing orthogroups to file

OrthoFinder assigned 253909 genes (90.6% of total) to 20994 orthogroups. Fifty percent of all genes were in orthogroups with 19 or more genes (G50 was 19) and were contained in the largest 4440 orthogroups (O50 was 4440). There were 2949 orthogroups with all species present and 728 of these consisted entirely of single-copy genes.

2024-07-28 16:40:04 : Done orthogroups

Analysing Orthogroups

2024-07-28 16:40:04 : Starting MSA/Trees Species tree: Using 1998 orthogroups with minimum of 94.7% of species having single-copy genes in any orthogroup

Inferring multiple sequence alignments for species tree

2024-07-28 16:41:20 : Done 0 of 1998 2024-07-28 16:47:42 : Done 100 of 1998 2024-07-28 16:52:48 : Done 200 of 1998 2024-07-28 16:57:26 : Done 300 of 1998 2024-07-28 17:02:55 : Done 400 of 1998 2024-07-28 17:08:46 : Done 500 of 1998 2024-07-28 17:13:56 : Done 600 of 1998 2024-07-28 17:19:06 : Done 700 of 1998 2024-07-28 17:24:12 : Done 800 of 1998 2024-07-28 17:28:47 : Done 900 of 1998 2024-07-28 17:33:15 : Done 1000 of 1998 2024-07-28 17:37:29 : Done 1100 of 1998 2024-07-28 17:41:08 : Done 1200 of 1998 2024-07-28 17:45:32 : Done 1300 of 1998 2024-07-28 17:50:04 : Done 1400 of 1998 2024-07-28 17:54:30 : Done 1500 of 1998 2024-07-28 17:58:40 : Done 1600 of 1998 2024-07-28 18:02:51 : Done 1700 of 1998 2024-07-28 18:06:21 : Done 1800 of 1998 2024-07-28 18:09:34 : Done 1900 of 1998

Inferring remaining multiple sequence alignments and gene trees

2024-07-28 18:21:09 : Done 0 of 18997 2024-07-29 16:11:05 : Done 1000 of 18997 2024-07-29 19:02:22 : Done 2000 of 18997 2024-07-29 20:43:32 : Done 3000 of 18997 2024-07-29 22:03:48 : Done 4000 of 18997 2024-07-29 23:04:07 : Done 5000 of 18997 2024-07-29 23:45:59 : Done 6000 of 18997 2024-07-30 00:10:57 : Done 7000 of 18997 2024-07-30 00:23:25 : Done 8000 of 18997 2024-07-30 00:30:18 : Done 9000 of 18997 2024-07-30 00:34:55 : Done 10000 of 18997 2024-07-30 00:37:31 : Done 11000 of 18997 2024-07-30 00:39:25 : Done 12000 of 18997 2024-07-30 00:40:23 : Done 13000 of 18997 2024-07-30 00:41:17 : Done 14000 of 18997 2024-07-30 00:41:40 : Done 15000 of 18997 2024-07-30 00:41:48 : Done 16000 of 18997 2024-07-30 00:41:57 : Done 17000 of 18997 2024-07-30 00:42:06 : Done 18000 of 18997 ———————————————————————————————— Using the "top" command, I noticed that iqtree analysis used only a single cpu for seven or eight days. In this supercomputer, I requested 64 cores, but only 1 core is running, which causes a lot of waste of resources, should I interrupt it or continue to run it? Could you please give me a good advice?

Thanks YuJiang

lauriebelch commented 3 months ago

Hi YuJiang,

You can definitely try increasing the -a option

'-a number_of_orthofinder_threads' In addition to the above, all of the critical internal steps of the OrthoFinder algorithm have been parallelised. The number of threads for these steps is controlled using the '-a' option. These steps typically have larger RAM requirements and so using a value 4-8x smaller than that used for the '-t' option is usually a good choice.

You could also consider an alternative to iqtree, like raxml or fasttree

Hope this helps!

Laurie

YuJiang0121 commented 3 months ago

Hi Laurie, I tried to add the -a option, using the -t 64-a 4 command, and the process was the same, the task stuck for seven days. After eight days of being stuck, I checked the software description, which explained that the number of a options could be 4-8 times smaller than -t, or "-1" could be specified. Now the task is still running, very slow!

Looking forward to your reply. Thank you very much. Best!

lauriebelch commented 3 months ago

Hi YuJiang,

Setting -a to 1 will make it run much slower than if you didn't set the -a option at all (with -t 64, it would default to -a = 8). I would reccommend trying it as 16

iqtree can also be quite slow - I would honestly consider just using fasttree

"you should be careful using any other tree inference programs, such as IQTREE or RAxML, since inferring the gene trees for the complete set of orthogroups using anything that is not as quick as FastTree will require significant computational resources/time."

Thanks,

Laurie

Neato-Nick commented 1 month ago

@lauriebelch do you know if it's safe to increase the number of threads for phylogeny estimation? My hesitation is that I don't know if orthofinder already estimates gene phylogenies in parallel, such that any increase in threads alloted to raxml / iqtree would be multiplicative.

In the config file I see the entry for raxml-ng, where I could edit this, so I'm mostly asking if it is a bad idea to do so.

    "raxml-ng":{
    "program_type": "tree",
    "cmd_line": "raxml-ng --msa INPUT --model LG+G4 --seed 12345 --threads 1",
    "ouput_filename": "INPUT.raxml.bestTree"
    },

if it's safe to do this, a "feature request" I'd have is to multi-thread the phylogeny estimation according to the number of analysis threads set with -a. At the very least I'd like to do that for the Species Trees, which have many thousands of sites