Ortholog groups against a known set of proteins

davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics

https://davidemms.github.io/

GNU General Public License v3.0

663 stars 185 forks source link

Ortholog groups against a known set of proteins #27

Open anuj2054 opened 8 years ago

anuj2054 commented 8 years ago

Hello, I have a set of 100 UniRef proteins. I want to find orthologs of this set of proteins against a set of 44 transcriptomes. Would OrthoFinder help me in this ? I know OrthoFinder can find ortholog groups amongst the 44 transcriptomes themselves, but i need orthologs of he 100 Uniref proteins in particular for a phylogenetics study. Thanks, Anuj, University of Oklahoma

j-kominek commented 7 years ago

Hi, I would like to second that proposal, since I'm in an identical situation, although my dataset goes into ~200 genomes and 50 references genes. Running a full all-vs-all with BLAST takes forever, diamond works better, but it still takes a lot of time, so right now I resorted to scripting around running pairwise OrthoFinder analyses between my reference sequences and each of the genomes, and then collating the results together, which feels somewhat like a dirty hack. I hope such a feature could relatively easily be incorporated into the program? Thank you for your consideration!

Cheers, -Jacek

davidemms commented 6 years ago

Hi Anuj and Jacek

I plan to add something that should help you do this. Currently I'm working on the paper that will describe the new OrthoFinder functionality since the first version (trees and orthologues) so it won't be until later this year before I can start work on development so unfortunately this may be too late for you, but it will be coming!

All the best David

faguil commented 6 years ago

Hi everyone and David,

Given that this issue is not closed and I have to do a similar job as described by Anuj and Jacek, I am wondering if anyone (David) has an idea about best practices to do when you have a short list of sequences (proteins) and would like to look for orthologous against hundreds of proteomes (gene models from genomes and predicted proteins from transcriptomes). I would appreciate ahy help on this matter.

Thanks in advance, Felipe

davidemms commented 6 years ago

I think I would take one of two approaches:

If at all possible, I would do just a normal OrthoFinder analysis on the proteomes and then identify the orthogroups corresponding to my reference genes using BLAST
If this is not possible because the number of proteomes (or number of sequences in the transcriptomes) is too large than I would try first using DIAMOND to reduce each proteome to only those sequences which have statistically significant similarity to my reference genes and then run OrthoFinder using these reference proteomes. With 100 reference genes I think you will probably get enough sequences in each proteomes for OrthoFinder to infer good paramters for each proteome and infer orthogroups affectively. The initial screening approach should make the size of the input data small enough for it to be easy to run OrthoFinder on it quickly using DIAMOND.

I hope this helps,

All the best David

faguil commented 6 years ago

Hi David,

Thanks for your sharing your thought with us about how to do this kind of analysis, I will try both of them.

To share with you (and others), I am blasting each proteome (more than 150 species, using diamond) against my reference proteins and will use the blast output(s) in OrthoFinder. I have not finished the blasting step but it should be done soon.

Regards,

Felipe

caonetto commented 6 years ago

I am not sure if this comment corresponds to this topic however, is there a way to perform a Bi-directional Best Hit analyses against a reference genome using orthofinder? I am trying to do a constraint analyses of many genomes against a reference in order to obtain a data set that shows the presence or absence of a certain protein (belonging to the reference) in each of the genomes.

Thanks

KristinaGagalova commented 4 years ago

Hi, I am not sure if this issue is still open. Another possible implementation will be to use OrthoDB with known relationship between the proteins-orthogroups and than align the proteomes against the OGs. Is anyone willing to implement something similar? My genome annotations are quite draft so this approach will be quite useful for me.

vragh commented 3 years ago

@faguil how did @davidemms 's suggestion work out for you?

I'm dealing with a similar situation (~20 transcriptomes -> proteomes) and a bunch of proteins of interest. What I've done is just add the proteomes that contain these proteins of interest (e.g., D. melanogaster) as an input alongside my sample proteomes to OrthoFinder. Then I simply pulled out all pairwise orthologs corresponding to the proteins of interest from the Orthologues/ directory.

Can anyone comment if this is an acceptable solution?

faguil commented 3 years ago

Hi @vragh,

I finally did the same approach as you said, and after some parsing of the otrhofinder output, it worked for me. This was done a while ago, so I do not have the script used on my hands. Sorry about that.

Best