drostlab / myTAI

Evolutionary Transcriptomics with R
https://drostlab.github.io/myTAI/
GNU General Public License v2.0
39 stars 16 forks source link

Defining Divergence Stratum #5

Closed YXXEMXL closed 5 years ago

YXXEMXL commented 5 years ago

Dear Dr. Drost,

I am writing to ask how to define Divergence Stratum. I am using the myTAI to calculate TDI of chili pepper genes. Ka and Ks values of each gene has been prepared using tomato genes as reference. But I don't know how to define Divergence Stratum. In your example, your assigned a value (1-10) into each gene (Row 1). Could you let me know how to define them? Thank you very much.

~Xiuxu

HajkD commented 5 years ago

Dear Xiuxu,

Many thanks for contacting me and I am very grateful for your feedback.

I hope this helps you to get some details: https://hajkd.github.io/orthologr/articles/divergence_stratigraphy.html .

In brief, a Divergence Stratum is defined as a decile (= 10% quantile) retrieved from all Ka/Ks (or dN/dS) values of all orthologs returned by the pairwise genome comparison.

In other words, imagine having 10000 orthologous genes and their corresponding Ka/Ks values after performing a pairwise genome comparison using the dNdS() function implemented in the orthologr package. Now, these 10000 Ka/Ks values follow a distribution between 0 and +Inf, where Ka/Ks < 1 reflects purifying selection, Ka/Ks = 1 reflects neutral evolution, and Ka/Ks > 1 reflects positive selection (in reality usually the largest Ka/Ks values I have seen are e.g. 100). Next, you bin these 10000 Ka/Ks values according to their 10% quantile (= decile), meaning that the lowest 10% of Ka/Ks values are in decile one (= Divergence Stratum 1), the lowest Ka/Ks values between the 11%-20% quantile are in decile two (= Divergence Stratum 2), ..., and the largest Ka/Ks values between the 91-100% quantile are in decile 10 (= Divergence Stratum 10) (This is what the DivergenceMap() function in the orthologr package does). This way, each Divergence Stratum has (almost) the same number of genes.

In contrast, using phylostratigraphy and the phylostrata categorization may lead to some phylostrata having e.g. 30% of all genes and some other phylostrata have only 1% of all genes. Since this gene number bias isn't corrected in any downstream analysis, I tried to avoid this bias when defining Divergence Strata.

I hope this helps?

Many thanks and best wishes, Hajk