davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
673 stars 186 forks source link

Interrupted Orthofinder in Gene tree construction step #467

Open Tdanis opened 3 years ago

Tdanis commented 3 years ago

Hello again,

I am running 90 days now Orthofinder for 30 species. Yesterday my job killed in the gene constuction step (raxm-ng). So, is there any solution to rerun orthofinder again from this step to avoid the reculculation of all the current Gene Trees? From ~25000 orthogroups calculated 4030 Trees.

And one more question, how many orthogroups can we remove from the start of the Orthogroups.GeneCount.tsv in order to speed up the calculation of the Species Tree?

Thank you in advance

davidemms commented 3 years ago

Hi

Another user gives a good method for doing this: https://github.com/davidemms/OrthoFinder/issues/461

I may have misunderstood the second question, but deleting orthogroups from the start of Orthogroups.GeneCount.tsv won't affect the species tree inference. If, however, you have a way of inferring the species tree faster by only using a smaller subset of alignments then you could do that yourself and then provide the species tree as input using the -s option rather than have OrthoFinder infer it. E.g. by using a smaller set of orthogroups for the concatenated alignment.

All the best David

Tdanis commented 3 years ago

Thank you David Emms.

Yes, the second part of my issue isn't so clear. So yes , how can we run orthofinder for a subset of orthogroups? I am asking because I do not have any alternative method. If we remove the orthofroups that we do not want to our analysis from Orthogroups.GeneCount.tsv, Orthogroups.txt and Orthogroups.tsv and rerun the orthofinder with -fg /home/Results/ option? is it correct?

Thank you again

davidemms commented 3 years ago

Hi

Another user has just asked a similar question which I gave some advice for here: https://github.com/davidemms/OrthoFinder/issues/465 . But I don't think hacking it this way and also in the way described in issue #461 would work together. The problem is really from using raxml-ng, which is too slow for this kind of analysis.

I think a better work-around which would let you use raxml-ng for the orthogroups you're interested in would be to use a fast method to perform a complete OrthoFinder MSA run and then use raxml-ng to replace the trees for whichever orthogroups you want:

All the best David

sknaack commented 3 years ago

Dear Dr. Emms,

First, let me say I'm very appreciative of the work that has gone into OrthoFinder2 to make gene orthology (especially gene tree) analysis computationally tractable. I'm finding it to be a fantastic improvement from previous efforts with varying software tools.

I'm doing an analysis of 3.6 Million selected protein sequences for 93 species and also wondering how best to constrain the set of ortho-groups in the gene tree/ortholog analysis of OrthoFinder 2.3.11 (as indicated with the -fg argument).

I hoped to confirm the following based on your responses to issues 461, 465 and 467:

  1. Are the trees in the WorkingDirectory/Trees_ids directory "resolved" or reconciled to the species tree? Could one convert those trees to proper sequence names with the convert_orthofinder_tree_ids.py script and consider them final? What about groups with three genes that still need to be rooted/sorted (assuming group with two are fairly trivial to write out as trees)?

  2. Is your main point that one can restart the ortholog assessment and generation of the final gene trees with the -ft argument from a directory of trees with the numerical ids (a WorkingDirectory/Trees_ids directory)? Will that be disrupted if, say, a tree file for group 0 is not found? I may need to confirm a few details of precisely how this argument configuration works (what paths should exactly be). I’ve copied a test results directory containing my trees, species tree species and sequences IDs and tried this option, but errors have been generated and I’m not sure it’s finding all needed inputs/information in the right order just yet, that may be user error.

Between these two points I hopefully have the jist of what needs to be done, but I wanted to make sure this was accurate and possibly follow up depending on your reply.

Sincerely,

Sara Knaack

P.S. To clarify, in this analysis ~78k ortho-group clusters were determined with MCL clustering using Blast all-vs-all comparison (and I=1.5), i.e., standard/default methodology. I ran the default gene tree analysis (fastme) for as many ortho-groups as possible naively using the "-fg" argument. With 15 parallel processes (using the mentioned species tree as input) the fastme gene tree analysis completed for of all but the single largest (a group with ~42k genes) in a week. The largest ortho-group ran another two weeks but ultimately there was a segfault/core dump before completion. A tree with numeric IDs is present for all ~37k ortho groups in the WorkingDirectory/Trees_ids subdirectory except the largest, but in particular neither the final lists of ortholog genes nor the final "resolved" gene trees (with species/sequence names) appeared in the main results directory, and I don't have complete knowledge of why. Due to the computational time involved (really, for the 10 largest ortho groups) I've not yet been able to confidently restart this analysis to troubleshoot without a modification to the inputs/arguments/setup.

davidemms commented 3 years ago

Hi Sara

Thanks for the comments on OrthoFinder, it's always great to hear people finding it useful!

  1. The Trees_ids/ are not reconciled, they are direct from the tree inference software and so are also unrooted. Converting them to proper sequence names would match those in Gene_Trees/ except that those in Gene_Trees have also been rooted using the rooted species tree. Groups with three genes are assumed to consist of only orthologs and within-species paralogs--this isn't guaranteed but it's the most parsimonious and the best we can do. To give a bit more detail: Arguably "(a1,a2,b1);" could be in fact correspond to "(a1,(a2,b1));" rather than "((a1,a2),b1);". In the first case a1 is a paralog of b1 whereas in the second case they are orthologs. The tree inference doesn't give us any strong information to decide between these two and so OrthoFinder doesn't bother trying to do a tree. Given the ambiguous situation, OrthoFinder picks the second interpretation since this requires one duplication whereas the first case requires both a duplication and a loss event. Chosing not to infer the tree also avoids issues if a the tree inference software being used returns an error if only three taxa are provided.

  2. Yes, that's the idea. It will be fine if some of the trees aren't present, so this is a way of skipping large/problematic trees. The ExampleDataset is great for testing out any 'exotic' workflows you might want to do as it's extremely fast to test things out on and gives you confidence that everything will work as expected.

Let me know if anything is unclear or if you have any further questions.

All the best David

sknaack commented 3 years ago

Hi David,

Thanks much for the feed back and confirmation. My next step will be to test the example data set and confirm if I can use the -ft option smoothly with such trees for that data set. Then when I'm confident that works I'll try again with our (full) data set. i'll likely get to that next week and follow up if I have any difficulty. Thanks!

Sara

sknaack commented 3 years ago

Hi David,

Quick follow up, I did successfully run the example data set in a few minutes and troubleshooted running it again using the -ft option with the largest tree/group removed. I have successfully done that and implemented it for my own (larger) analysis.

I'm getting an error (previously reported) that too many files were opened, i.e., the number of species pair ortholog gene list files is larger than my default user limit allows. I'm addressing that with my computing administrators now and I believe I'll be successful in running the default gene tree/ortholog analysis through to the final stages for all but that largest tree with this modification once that is addressed. Thanks very much!

Sara

davidemms commented 3 years ago

Hi Sara

That's good to hear, thanks for letting me know. Sorry you had an issue with the number of open files. I have looked to see if there's a way of maintaining performance while not opening so many files at once but there doesn't appear to be. Hopefully it'll be fine no you've increased the limit.

All the best David

sknaack commented 3 years ago

Thanks David!

No worries, I fully get how challenging addressing an issue like that can be. My open file limit is now expanded (thanks to fairly incredible computing resources and swift acting IT folks) and fingers crossed now it will proceed expediently. It has run stably for >30 min now on my full set of trees (minus one).

Regarding the need to have so many files open at once, a strategy to consider would be to pass the ortholog pair information to a global ("map" or map-like) object in memory while looping over trees and identifying the said pairs. That "map" could even maintain which sequence pairs belong to each species pair. Then, after the analysis is complete, write out the ortholog pair files at the end (in parallel even). It could be varyingly challenging to implement this depending how the code is organized, of course. I can mainly envision how it could be done in C/C++. Perhaps an approach to consider for v3? =-) Am being presumptuous here... Have a good day.

Best,

Sara

davidemms commented 3 years ago

Hi Sara

I agree that that's definitely an approach worth exploring. I'd been a bit nervous about the RAM requirements and the possibility that they could grow too rapidly (such a map is quadratic in the number of species whereas I've tried to keep OrthoFinder's RAM requirements linear). That said, it may be small enough that it doesn't really become an issue in practice, I should probably give it a try!

All the best David