DessimozLab / pyham

MIT License
9 stars 5 forks source link

Extract genes from specific nodes #20

Closed eli1199 closed 2 years ago

eli1199 commented 2 years ago

Hi there,

I used OMA to infer Hierarchical Orthologous Groups (HOGs) and then ran pyham, see code and result below:

pyham_analysis = pyham.Ham(newick_path, orthoxml_path, use_internal_name=False, tree_format='newick')
treeprofile = pyham_analysis.create_tree_profile(outfile="my_tree.html")

image

I don't know how to extract the genes at that particular node that is circled. For example how to get those 299 gained genes by name? I see that lateral and vertical comparisons can be done but how do I get the genes from a specific node?

Thank you.

F4llis commented 2 years ago

Hello,

Nothing more simple !

The first step is to do a vertical comparaison between a taxon and its parent taxon like following: human = ham_analysis.get_extant_genome_by_name("HUMAN") vertebrates = ham_analysis.get_ancestral_genome_by_name("Vertebrata") vertical_human_vertebrates = ham_analysis.compare_genomes_vertically(human, vertebrates)

Then, you can retreived any of the categories plotted (gained, lost, retained, duplicated) from the vertical_human_vertebrates object previously created like following: vertical_human_vertebrates.get_gained()

If you need more details on this, please have a look at https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html#Vertical-comparison

If you need more help, please contact us again :)

eli1199 commented 2 years ago

Thank you for such a quick reply @F4llis . I think I understand how to indicate the taxon but not the parent taxon. For example in the image above, if I want to get the list of 202 gained genes in the un-circled node, I can use "E" or "S" as the taxon, but i'm not sure how I should indicate the other taxon so I can retrieve those 202 genes

specie_E = ham_analysis.get_extant_genome_byname("E") specie = ham_analysis.get_ancestral_genome_by_name("A??")

F4llis commented 2 years ago

If you want to retreived an ancestral genome you can do as document here: https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html#AncestralGenome-queries

There is several method to do so: by name, by mrca of extent genome object, etc..

You have a bit more information here about the object in pyham https://zoo.cs.ucl.ac.uk/doc/pyham/api.html. You would discover that each genome object you can get have a taxon attribute that is part of the a master taxonomy object build from the reference tree inputted in pyham. These objects are ete3.TreeNode, this means that if you have a genome X you can do

X taxonomy object  => X.taxon
X parent taxonomy object  => X.taxon.up
X parent genome => X.taxon.up.genome

Then, there is several ways of addressing your needs if you have named ancestral genome or not, etc..

F4llis commented 2 years ago

PS: you need to imagine that pyham objects are organizing and linked like in real life. You have a taxonomy object that represent the phylogeny, where attached to each nodes you have ancestral genomes and to each leaves extant species. Each of these Genome object have respectively ancestral genes (HOGs) or Extant genes attach. And each of these relationship are encoded in the object: you can go from a gene to its genome to the parent genome to etc...

eli1199 commented 2 years ago

Thank you @F4llis ! I was able to get the list of HOGs using your suggestion. The list came out like this: [<HOG(708)>, <HOG(709)>, <HOG(710)>, <HOG(711)>, etc etc... But I still need the actual gene names included in those HOGs. Is there any atribute for that?

F4llis commented 2 years ago

It means you get HOG Objects which is correct ! Then, I invite you to look at the tutorials i send you. Here https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html#HOG you can see that you can get all descendants (extant) genes for a given HOG.

eli1199 commented 2 years ago

Hi @F4llis , the vertical_human_vertebrates.get_gained() works perfectly, and I get output like: [<HOG(708)>, <HOG(709)>, <HOG(710)>, <HOG(711)>, etc etc. I wanted to try to get the genes retained at the same node, so the 1089 genes in the image above which are 'retained'. I used vertical_human_vertebrates.get_retained() but the output looks different and there are a lot of HOGs without a # (see below):

<HOG(11239)>: <HOG()>, <HOG(11240)>: <HOG()>, <HOG(11241)>: <HOG()>, <HOG(11242)>: <HOG()>, <HOG(11243)>: <HOG()>, <HOG(11244)>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, <HOG()>: <HOG()>, etc etc etc.

If you count the total number of "HOGS" outputted, there is 2178 which is exactly double of 1089. Why does that happen? I just need the 1089 and their ids so I can get the descendant genes. Thank you

lwadi commented 2 years ago

Hi @eli1199 / @F4llis , did you end up figuring this out? I get the expected number of HOGs for lost category, but I also see more HOGs than expected when I query for 'retained', and many of them have no number/id: "HOG()"

How can we get the descendant genes for those HOGs without ids? get_hog_by_id will not work for these cases.

Thanks!

eli1199 commented 2 years ago

Hi @lwadi , no I have not yet figured out why there are more retained HOGs than expected however I found out that HOGS with numbers (example: HOG(709)) are 'root HOGs' and those without numbers (HOG()) are 'sub HOGs' so maybe this could help. I think we should still be able to get genes from sub HOGs, but not sure how.

Hoping @F4llis / @alpae can shed some light here!

F4llis commented 2 years ago

Hello,

@eli1199 is right regarding root/sub HOGs: If your orthoxml doesn't have internal hogs ids only the root hogs will have IDs (that is shown between the brackets when printed).

Regarding the descendants genes, once again, I invite you to read the documentation in shared few messages ago (https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html#HOG):

desc_genes = hog1.get_all_descendant_genes()

Regarding the retained HOGs it is presented as following <HOG(11239)>: <HOG()> where the first one is the HOG at the upper level and the second on the HOG the lower level. This is how you can track genes behaviour in the vertical mapper otherwise you wouldn't know from where your HOG derived from.

In the documentation (https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html#Vertical-comparison) the example show a vertical map between vertebrata and human resulting in the following output:

HOG at vertebrates -> descendant gene in human
{<HOG(1)>: Gene(1)}

Regards, Clement

eli1199 commented 2 years ago

@F4llis I think the code that you mention desc_genes = hog1.get_all_descendant_genes() works if you have IDs but there are cases when all the HOGs at that specific node are sub HOGs and therefore do not have IDs

hog1 = ham_analysis.get_hog_by_id(???)
desc_genes = hog1.get_all_descendant_genes()
F4llis commented 2 years ago

No, get_all_descendant_genes work on all possible HOG objects. You can get any hog both root or sub using different methods:

All of this is catalogs here: https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html#Welcome-to-HAM-Tool-Box!