Closed pappewaio closed 11 years ago
To begin with, it should do the following Pseudocode:
if(ortholog){ Keep the one with the shortest branch } else{ Tell the user that we have a potential paralog }
It gets a bit more complicated when there are more than 2 copies of the gene/protein in one species. If for example two of them end up together in the tree while a third one is "off", the two are probably orthologs to the ones in the other species and the third is not. This should be noted, but how to handle it should be left to the user.
The sub findParalogs detects paralogs in the easy case when there are 2 copies of a species in the tree. It will return an array with the names of the paralogs.
The two bioperl-functions get_lca (http://www.bioperl.org/wiki/Least_common_ancestor) and get_all_Descendents (http://www.bioperl.org/wiki/HOWTO:Trees) should solve it.
If, for instance, there are n human homologs of the gene of interest in the tree: Pseudocode: @potParalogs = All human nodes LCA = get_lca(@potParalogs) @children = get_all_Descendents(LCA)
if(all elements in @children == "human"){ HURRA! } else{ PARALOG! }
This is given that the two functions work as we hope :)
The pseudo-code above was implemented and the resulting function verified to work (with the help of the test findParalogs.t). What remains is to throw a warning in Main.pl when paralogs are detected.
One step in our curation pipeline is to find probable paralogs. One software that could be useful is http://www.bioperl.org/wiki/HOWTO:Trees