davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
673 stars 186 forks source link

Orthogroup not split at duplication node #820

Open ens-sb opened 1 year ago

ens-sb commented 1 year ago

Hi David,

I have run OrhoFinder 2.5.4 on the example data with the -y option and otherwise default parameters. The HOGs in the file Phylogenetic_Hierarchical_Orthogroups/N0.tsv for the orthogroup OG0000002 look like this:

N0.HOG0000002   OG0000002       n1      gi|290752801|emb|CBH40776.1|    gi|31541628|gb|AAP56928.1|      gi|3844776|gb|AAC71399.1|       gi|71851739|gb|AAZ44347.1|
N0.HOG0000003   OG0000002       n6      gi|290752802|emb|CBH40777.1|    gi|31541629|gb|AAP56929.1|      gi|3844775|gb|AAC71398.1|, gi|84626133|gb|AAC71526.2|   gi|71851738|gb|AAZ44346.1|
N0.HOG0000004   OG0000002       n11             gi|284811859|gb|AAP56391.2|     gi|3845003|gb|AAC71638.1|
N0.HOG0000005   OG0000002       n13     gi|290752894|emb|CBH40869.1|, gi|290752592|emb|CBH40564.1|      gi|284812058|gb|AAP56715.2|     gi|1045683|gb|AAC71230.1|
N0.HOG0000006   OG0000002       n17     gi|290752526|emb|CBH40498.1|    gi|31541268|gb|AAP56569.1|      gi|1045987|gb|AAC71511.1|       gi|71851837|gb|AAZ44445.1|
N0.HOG0000007   OG0000002       n20     gi|290753081|emb|CBH41057.1|    gi|31541070|gb|AAP56372.1|      gi|3845062|gb|AAC72487.1|, gi|3845064|gb|AAC72489.1|, gi|1045740|gb|AAC71283.1| gi|144227640|gb|AAZ44535.2|

In this output I have noticed that the gene tree parent clade of HOG N0.HOG0000007 is a duplication node (n20). Can you shed light on the reason why this HOG was not split by the -y option.

Many thanks, Botond

davidemms commented 1 year ago

Hi Botond

I had to check the code here to remind myself how it works. Basically, when a duplication is observed, OrthoFinder requires evidence that both child nodes correspond to the N0 species tree node in order to split it into two HOGs at the N0 level. Since the one child is just a single gene, it doesn't get split off.

My thoughts when I developed this was, I think, that this was a convenience to users -- they probably didn't really want single genes split off from orthogroups, that could have just ended up there due to tree inference inaccuracies, although technically if we believe the tree then we should be splitting. How do you feel about that, and is the behaviour a problem for your use case?

The "-y" option corresponds to a slightly different situation, but I think it might be appropriate here. I will be create a new, major-version release soon and could potentially include this case for the '-y' option.

Best wishes David

ens-sb commented 1 year ago

Hi David,

Thank you very much for the quick response! Our use case is that we would like to parse out the pairwise orthology relationships from the HOG TSV files. For this indeed it would be less confusing if the orthogroup was split by the -y option in the cases like the one discussed above so we do not have pairs of orthologous genes which have a duplication node as common ancestor in the gene tree.

It would be nice if you could include this in the next release of Orthofinder. I also would have a couple of other requests for that, I will submit them as separate issues for your consideration.

Many thanks, Botond