davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
708 stars 188 forks source link

Benchmark QUEST FOR ORTHOLOGS custom settings #232

Closed HENdavid closed 5 years ago

HENdavid commented 5 years ago

Dear David,

I recently started to test Orthofinder for large scale comparative genomics in bacteria. So far so good, nevertheless I got few orthogroups that contain more than 20 proteins per strain so I got suspicious. When looking at the alignments I realize it was quite messy therefore that could be a low chance that the members of those orthogroups are all true orthologs. I started to test several parameters in order to improve the prediction, I ended up using (mmseqs and inflation value of 2). Now I am happy with the results since I think the predictions in general were improved but I would like to confirm the acuracy of this setting using the dataset from QUEST FOR ORTHOLOGS. Do you mind to guide me for submiting the results of such comparison (I would like to use the reference proteomes 2018 and 2011).

davidemms commented 5 years ago

Hi

In terms of the accuracy, I think you may have confused orthogroups and orthologues: https://github.com/davidemms/OrthoFinder#orthogroups-orthologs--paralogs. As explained in the link, not all members of an othogroup are necessarily orthologues of one another. They are different things. To get orthologues, OrthoFinder infers a gene tree for each orthogroup and infers orthologues and gene duplication events from these trees. The latest paper on biorxiv explains the whole process if you want all the details: https://www.biorxiv.org/content/early/2018/11/08/466201

To test using Quest for Orthologues, the first step if to download the reference proteomes and run OrthoFinder on these. For each species pair there will be a file giving the orthologues between these species. To submit them to QfO you'll need to create a file in a format that QfO understands. I think the easiest format is their text file format. It requires one line for each orthologue pair, separated by a tab:

Sp1_gene1<tab>Sp2_gene7
Sp1_gene23<tab>Sp2_gene2
etc.

You should be able to find an explanation this format on their webpage.

The most important thing is to make sure you correctly translate the orthologues from OrthoFinder into the QfO format. Specifically, for co-orthologues, OrthoFinder will give you lines like this:

OG0000007<tab>Sp1A, Sp1B, Sp1C<tab>Sp2D, Sp2E

Meaning that each of the genes (Sp1A, Sp1B, Sp1C) is an orthologue of each of the genes (Sp2D, Sp2E). I.e. the orthologues diverged from one another at a speciation event.

You need to put all the pairs (Sp1, Sp2) into the QfO file as they are all orthologues:

Sp1A<tab>Sp2D
Sp1A<tab>Sp2E
Sp1B<tab>Sp2D
Sp1B<tab>Sp2E
Sp1C<tab>Sp2D
Sp1C<tab>Sp2E

All the best David