davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
673 stars 186 forks source link

How Accurate are Species-Specific Orthogroups? #819

Open Cronzo opened 1 year ago

Cronzo commented 1 year ago

Hi David,

I was recently using OrthoFinder to look for species-specific orthogroups in a new genome that we've assembled. However, I noticed that several of the orthogroups classified as species-specific contained genes that were highly similar (based on BLAST) to other genes found in other species that were also included in the run. In other words, OrthoFinder has split some of these groups too finely, resulting in some gene families having members separated into another orthogroup. There is also another case where we observe OrthoFinder splitting a enzyme family into three separate orthogroups.

This is particularly strange to me as I assumed that DIAMOND/BLAST would pick up these highly similar sequence pairs. Would this be a problem with the clustering and do you think adjusting the MCL inflation parameter might help? Alternatively, would you recommend digging into the raw BLAST output for my use case instead (looking at species-specific genes) so I can avoid dealing with the clustering algorithm altogether? I recall reading that you don't recommend adjusting the inflation value at all.

Many thanks in advance!

kullrich commented 1 year ago

Hi @Cronzo,

in your case I would take the "species-specific" orthogroup members and search them in one or more sister species according to your query species.

You could even use https://github.com/lh3/miniprot to search protein against a sister species genome, if there are no annotations for the sister species.

Since there might be already non-coding or not-annotated sequences that might code for such orphan genes/gene-families.

As an alternative you could extract a centroid sequence for each orthogroup and compare/cluster the centroids to each other to merge them back into larger gene families based on some identity thresholds.

Best regards

Kristian

Cronzo commented 1 year ago

Hi Kristian,

Many thanks for your input! I will definitely try some of the suggestions you wrote. And just wondering, would you say that this outcome is normal for OrthoFinder by virtue of the algorithms it uses? I'm still curious as to why such similar proteins (with very low e-values) between sister species are not detected in the all-versus-all BLAST search and would depend on the MCL clustering instead.

Cheers