How accurate can be Orthofinder using small proteomes files?

I have been using the tool for a long time, thank you for your contribution to the community!

Let me give you an idea of what I'm doing. I downloaded 12 proteomes, all with good annotation levels and only references, from mammalian. Then I performed the analysis and extracted the orthogroup file, which I parsed to have the information necessary to use it later, which is the orthogroup, protein-id, species.

In parallel, I ran the NCBI-dataset tool link, in short, I can download orthologs from an accession list if I specify the taxa of interest. In this case, I'm trying to annotate unknown genes from my species of interest. Everything worked as expected, one of the useful outputs that I can obtain is metadata associated with each ortholog, protein, and RNA sequence for those orthologs. So, my idea was to validate the unknown genes (obtained from RNAseq analysis) using both tools, so the logic was: if I can find the protein-id in the same orthogroup and then merge it with the metadata obtained, so I can have better evidence that two different tool-grouped proteins in the same orthogroup. However, this was not the case, because when I performed the merge process, I got multiple rows of duplicates. I know that maybe is because of the facts of many-to-many proteins in one orthogroup, or maybe due to another phenomenon related to the protein-orthogroup relationship.

Then, I thought maybe I can slit the faa file from the NCBI-dataset tool, based on the 12 species. So I will have small "proteomes", that contain orthologs related to my unknown genes, that in theory, I can run with orthofinder to validate the orhogroups. However, I think that this idea is not reliable, because the sequences are too small compared with the level of the proteome. So, what do you think about this approach?

Do you have any advice on a good way to validate orthologs from these two tools?

davidemms / OrthoFinder

How accurate can be Orthofinder using small proteomes files? #931