davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
710 stars 189 forks source link

How accurate can be Orthofinder using small proteomes files? #931

Closed MauriAndresMU1313 closed 1 month ago

MauriAndresMU1313 commented 1 month ago

I have been using the tool for a long time, thank you for your contribution to the community!

Let me give you an idea of what I'm doing. I downloaded 12 proteomes, all with good annotation levels and only references, from mammalian. Then I performed the analysis and extracted the orthogroup file, which I parsed to have the information necessary to use it later, which is the orthogroup, protein-id, species.

In parallel, I ran the NCBI-dataset tool link, in short, I can download orthologs from an accession list if I specify the taxa of interest. In this case, I'm trying to annotate unknown genes from my species of interest. Everything worked as expected, one of the useful outputs that I can obtain is metadata associated with each ortholog, protein, and RNA sequence for those orthologs. So, my idea was to validate the unknown genes (obtained from RNAseq analysis) using both tools, so the logic was: if I can find the protein-id in the same orthogroup and then merge it with the metadata obtained, so I can have better evidence that two different tool-grouped proteins in the same orthogroup. However, this was not the case, because when I performed the merge process, I got multiple rows of duplicates. I know that maybe is because of the facts of many-to-many proteins in one orthogroup, or maybe due to another phenomenon related to the protein-orthogroup relationship.

Then, I thought maybe I can slit the faa file from the NCBI-dataset tool, based on the 12 species. So I will have small "proteomes", that contain orthologs related to my unknown genes, that in theory, I can run with orthofinder to validate the orhogroups. However, I think that this idea is not reliable, because the sequences are too small compared with the level of the proteome. So, what do you think about this approach?

Do you have any advice on a good way to validate orthologs from these two tools?

davidemms commented 1 month ago

Two suggestions: In the version 3 beta release you can add the option --scores-v2, this is a lot more robust for cases when there are only a few genes per species.

I haven't looked at the NCBI dataset tool, but please keep in mind that two genes from different species in the same orthogroup are not necessarily orthologs, this follows from the definition of an orthogroup. There is some info on this in the Readme file where it discusses an example tree. This may not be the cause of the issue you're encountering, but I mention it here in case it is.

Best wishes David