davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
686 stars 186 forks source link

Orthofinder vs Agalma #585

Closed jasminelmah closed 3 years ago

jasminelmah commented 3 years ago

Hi David! Thanks for writing such a great program!

Previously I have been working with a different program, Agalma's treeprunefunction (see paper here), to identify orthologs shared across species, but currently I am trying to make the switch to Orthofinder. Similar to Orthofinder, Agalma identifies speciation and duplication nodes on the gene tree to identify orthologs vs paralogs. For gene trees with paralogs, Agalma then cuts the tree into maximally inclusive subtrees, where each species in the subtree is represented only once (no paralogs). So, in a simple example, if there is a gene tree with 3 species (A, B, C) with two clades, each clade containing paralogs from each of the three species (if A1 and A2 are paralogs for species A, then eg. A1, B1, C1 in clade 1 and A2, B2, C2 in clade 2), the gene tree would be cut into the two corresponding subtrees where each species is represented only once: subtree 1 - A1, B1, C1 and subtree 2 - A2, B2, C2. The collection of genes defined by each subtree are then considered orthologs to each other and 1:1 homologs.

Similar to Agalma, Orthofinder also divides each gene tree/orthogroup into subtrees/subclades - hierarchical orthogroups (HOGs). This occurs when a duplication node maps onto a particular species node (N0, N1...) in the species tree. Slightly differently from Agalma, however, which species are encompassed by each HOG is defined according to the desired node on the species tree - so HOGs defined by N1 consist of sequences from only those species that descend from N1 on the species tree. Furthermore, HOGs can still contain paralogs - is that a correct interpretation of the entries where more than one sequence per species is given per HOG?

I have two questions! Why do HOGs contain paralogs? Is this because these paralogs are many-to-1 orthologs for other sequences in the HOG? Second, would the N0 HOGs that contain only 1 sequence per species be equivalent to the 1:1 homologs identified by each subtree in Agalma?

Thanks for your help! Greatly appreciated

davidemms commented 3 years ago

Hi

Yes, where there is more than one gene from a species in a HOG then they are paralogs. The paralogs are implicit defined in the OrthoFinder results files in that all orthologs are listed in the orthologs files and any pairs of genes from the same orthogroup that are not listed as orthologs are, by implication, paralogs.

The HOGs contain paralogs because the HOGs are the orthogroups defined at a particular level in the species tree, and any orthogroup will contain paralogs if there has been a gene duplication event after the root of the orthogroup, i.e. after the node in the species tree at which the HOG is defined. There's a explanation of orthogroups, and paralogs in orthogroups, here: https://github.com/davidemms/OrthoFinder#orthogroups-orthologs--paralogs. So yes, it is exactly as you say, it's because there are genes that are many-to-1 orthologs for other sequences in the HOG.

For the last question, I don't think so, from your description I think they'd only be the subset of ones in Agalma that are 1:1 in all species. I do have a tool that I've written which might be helpful for you for 1:1 sets of orthologs, but I've not yet had time to write a preprint for it before I put it online. If you send me an email I can share it with you and you can test to see if it's helpful.

Best wishes David

jasminelmah commented 3 years ago

Thanks for the reply David.

I think I understand HOGs more deeply now. Thanks for the explanation! That clears up a lot of things in my mind.

I don't think so, from your description I think they'd only be the subset of ones in Agalma that are 1:1 in all species

Were you referring to the single copy ortholog sequences? I did notice that each FASTA file in the single copy ortholog directory features a single sequence from all species, not subsets of species. In this way, I'd agree that only the Agalma 1:1 orthologs shared across all species would be equivalent. But what about the HOGs that contain a single sequence per species, but only from a subset of species? Would this be equivalent to subclades of the gene tree where a sequence from each species for this subset of species occurs only once?

Again, thanks for your help!

davidemms commented 3 years ago

That's a good point. Yes, I think these would be equivalent.

jasminelmah commented 3 years ago

Thanks, that makes a lot of sense! Much appreciated.