DessimozLab / pyham

MIT License
9 stars 5 forks source link

Some questions when following the tutorial of pyham #23

Closed zmz1988 closed 1 year ago

zmz1988 commented 1 year ago

Hi, thanks for providing this cool tool! I just have some questions during my trial following the tutorial.

(1) When I tried the iHam part for visualisation, I always got an error. I guess I must have don'e something wrong, or probably I didn't install iHam? Please see below the error:

>>> hog_33935 =  ham_analysis.get_hog_by_id(33935)
>>> hog33935_Arab = hog33935.get_at_level(Arab)
>>> output_name = "output/HOG{}.html".format(hog33935_Arab.hog_id)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'hog_id'

(2) Also, I have a question about extracting genes from sub HOG. I'm following the question here #20 , and I got into the probable as well in sub HOG that has no ID. In my case, I would like to know which paralogous gene in the root HOG are the closest homologs between species. But without a HOG ID, I can't tell that, right?

For example, I see my HOG39414 has two genes (318017 and 317326) from species A and two genes (359917 and 359183) from species B, and also one or two genes from other species. I had compared their proteins sequences, and it's highly likely that 318017 would be blasted to 359917, and 317326 would be to 359183, if it's in a blastp result. But now they are all in one root HOG, and their sequences are in one single HOG output fasta file in the OMA standalone output. So my question is whether there is a way to separate those genes from their sub HOG (in this case, 318017 would be put together with 359917, and 317326 would be with 359183)?

E.g. codes see below:

>>> hog33935 = ham_analysis.get_hog_by_id(33935)
>>> desc_genes_clustered = hog33935.get_all_descendant_genes_clustered_by_species()
>>> for species, genes in desc_genes_clustered.items():
...     print(species, genes)
... 
ARAAL [Gene(2106)]
BRARP [Gene(287092)]
BRANA [Gene(129916), Gene(146488), Gene(172670)]
BRAOL [Gene(203635), Gene(232910)]
A [Gene(318017), Gene(317326)]
B [Gene(359917), Gene(359183)]
ARATH [Gene(60045), Gene(59378)]
ARALY [Gene(25956)]

>>> desc_hog = hog33935.get_all_descendant_hogs()
>>> print(desc_hog)
[<HOG(33935)>, <HOG()>, <HOG()>, <HOG()>, <HOG()>, <HOG()>, <HOG()>, <HOG()>, <HOG()>, <HOG()>, <HOG()>, <HOG()>, <HOG()>, <HOG()>]

Thanks a lot in advance, and looking forward to your reply!

F4llis commented 1 year ago

Hello,

For (1) this is normal because get_at_level return a list (see documentation). Imagine you have a duplication in between the root and the level 'Arab', this function will return 2 sub hogs.

For (2) as mentioned in the answers of #20, Sub-hogs not labelled with ids are not a problem to browse. You can use HOG hierarchy to navigate into them (parent, children, duplication attribute), you can use vertical mapping to jump from a level to another lower level (bonus: all hogs are mapped to their descendants), you can use horizontal mapping to compare to species (or taxa) according to their MRCA (bonus: you get all paralogs clustered by their common ancestral genes), etc...

zmz1988 commented 1 year ago

Thank you so much for the fast reply! Ok, now I get it.

So for (1), I would just need to call the root level of HOG.

For (2), I tried the code for h, gs in rodents.get_ancestral_clustering().items(): print("HOG: {} -> genes: {}".format(h,gs)) (as in in [39] from the tutorial file), and it seems that I got a list of HOGs including gene names for one of the ancestral genome (e.g. Arab in the example above). Then I have a question here: what's the difference between the genes I got by using in [39] and by using the vertical or horizontal comparison? Is that the gene I got from in [39] are only common genes in all the leaves (no singleton in each leaf), but with vertical or horizontal comparison I could get the singleton as well?

F4llis commented 1 year ago

Both from documentation:

(A) alway return HOGs map to their extant genes (at species level then) while (B) allow to compare any two levels AND classify HOGs into above categories.

zmz1988 commented 1 year ago

Thanks! So if I understand correctly, if I would like to extract homology for genes from three species (three leaves with two nodes), let's say they have a common ancestor A, then I should use get_ancestral_clustering() for the node A, right?

In two level comparison (two leaves) cases, the number of genes output from get_ancestral_clustering() would be the same as the sum of get_loss(), get_gain(), get_duplicate() and get_retained(). Right?

F4llis commented 1 year ago

Yes ! But you should say "extract orthology" instead of " extract homology".

Yes again !