Similarities between orthologue sequences

dariober commented 4 years ago

Hello - I would like to obtain a similarity metric between the pair of orthologue proteins detected by orthofinder or, even better, all the pairwise similarities in the input set of sequences. Is this data somewhere in the orthofinder output?

Basically, I'm looking for a score that tells me how confidently two sequences can be considered orthologues.

Would the matrix WorkingDirectory/OrthoFinder_graph.txt be the right file to parse?

If this metric is not readily available, I would appreciate any suggestion about a reasonable way to get it.

In the meantime, many thanks for this great program! Dario

davidemms commented 4 years ago

Hi Dario

I think that perhaps such a metric comes from thinking about orthology from the point of view of methods that use blast similarity scores for orthology assignment. In reality, similarity scores aren't a very reliable method of determining orthology and that's why it's necessary to examine the gene tree.

As long as you have sufficient species sampling to avoid problems of hidden paralogy (>6-8 species) I would say that because each orthology assignment in OrthoFinder has a tree as evidence for orthology you should have a fairly high confidence that genes identified as orthologs really are orthologs. I think the place where you might have lower confidence is in the other direction: genes that aren't identified as orthologs but might actually be orthologs.

With that in mind though, if you wanted to develop a confidence score I'd suggest there are two possibilities. The best option would be a tree based metric, perhaps looking at the support values in the tree that the orthologs come from. Alternatively, a score based on pairwise similarities could be interesting. The score would be less reliable than the original ortholog assignments based on the trees, but it would provide an alternative viewpoint.

All the best David

dariober commented 4 years ago

Hi David - Thanks a lot for that!

looking at support values in the tree that the orthologs come from

Sorry... where can I get these support values?

davidemms commented 4 years ago

The orthologs files lists the gene tree that the genes come from. If you look at the corresponding tree in the "Gene_Trees/" directory (rather than the Resolved_Gene_Trees directory) then you can see the support values. This assumes you use either the defaults for tree inference or you use "-M msa" with default (fasttree) tree inference. If you specify your own choice of tree inference program using "-T ..." then you'd need to make sure this gives you support values.

All the best David

dariober commented 4 years ago

Thanks for your patience...! Here's the output of one of the files in Gene_Trees as an example. Are the support values just the number after each colon (:)? (I assumed that is the branch length)

This is with orthofinder 2.4.0 with defaults.

cat Gene_Trees/OG0005824_tree.txt

((CveliaCCMP2878_Cvel_15175.t1-p1:0.561646,VbrassicaformisCCMP3155_Vbra_23085.t1-p1:0.689225):0.0760533,(CcayetanensisCHN_HEN01_cyc_04586-t31_1-p1:0.878521,((((HhammondiHH34_HHA_288850-t26_1-p1:0.012279,TgondiiGT1_TGGT1_288850-t26_1-p1:0.014348):0.224667,(SneuronaSN3_SN3_02900210-mRNA-1-p1:0.54163,NcaninumLIV_NCLIV_041260-t26_1-p1:0.212119):0.0255484):0.0831396,CsuisWienI_CSUI_010279-t36_1-p1:0.509704):0.174971,((((EbrunettiHoughton_EBH_0002720-t26_1-p1:0.0325406,EpraecoxHoughton_EPH_0051460-t26_1-p1:0.207264):0.0213542,(EmitisHoughton_EMH_0022820-t26_1-p1:0.0209922,EacervulinaHoughton_EAH_00020700-t26_1-p1:0.0306458):0.0133518):0.0364865,(EtenellaHoughton_ETH_00010540-t26_1-p1:0,EnecatrixHoughton_ENH_00066310.1-p1:0):0.0716972):0.0463027,(EfalciformisBH_EfaB_MINUS_23210.g2012.t1-p1:0.195127,EmaximaWeybridge_EMWEY_00002240-t26_1-p1:0.357334):0.0162796):0.301135):0.191798):0.0760533);

davidemms commented 4 years ago

Sorry, my mistake, the default doesn't give support values. You'd actually have to run with "-M msa", this will use FastTree to infer the gene trees from multiple sequence alignments and this will give you Shimodaira-Hasegawa support values. They'll be the numbers immediately after the brackets and before the colon. You can rerun you analysis starting after the orthogroups stage using the -fg (--from-groups) option:

orthofinder -fg PREVIOUS_RESULTS_DIR -M msa

It'll will probably still be a difficult question to algorithmically decide which support values are relevant in deciding the level of confidence in a pair of orthologs, and how to quantify the support given these support values. The considerations are complex for deciding which bipartitions in the tree are relevant in any particular case.

All the best David

sunjiahe-hub commented 1 year ago

@dariober @davidemms I have a similar problem, please'/WorkingDirectory/OrthoFinder_graph.txt' file means what? /WorkingDirectory/OrthoFinder_graph.txt' file, what does it mean? Does the number before the colon represent a protein ID and the number after the colon represent a score?

I look forward to hearing from you, thanks!

davidemms / OrthoFinder

Similarities between orthologue sequences #435