arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 41 forks source link

Plotting Ks distributions #24

Closed ShuaiNIEgithub closed 4 years ago

ShuaiNIEgithub commented 4 years ago

Hi, Sorry for disturbing you again.

I tried using wgd last week, which is a great tool! And In use, I have two questions:

  1. How to plot Ks distribution of one-to-one orthologs of C. papaya and A. thaliana, as shown in Fig1. A and E in "wgd—simple command line tools for the analysis of ancient whole-genome duplications"?
  2. Can I use the result of ksd directly to plot Ks distribution?

GREETING!

arzwa commented 4 years ago

Hi; there are some tools for plotting distributions in wgd (wgd viz); and by default when you run wgd ksd you'll get an svg file with a plot; which you can further edit in programs like inkscape or so if desired.

Of course one can just use one's favorite plotting tool (e.g. matplotlib, R, gnuplot, ...) to plot the tsv output from ksd, but beware that for whole-paranome distributions you should generally plot node-weighted or node-averaged histograms (i.e. within a gene family, you should average or weight the Ks values for each node of the gene family tree) and not plot the 'raw' Ks values (see this section in the docs for more info). For instance, in R you could use the following code for a node-averaged histogram:

# read in data
df = read.csv("your_distribution.tsv", sep="\t")

# filter Ks distribution (0.001 < Ks < 5)  (try different cut-offs)
df = df[df$Ks < 5,]
df = df[df$Ks > 0.001,]

# perform node-averaging (redo when applying other filters)
dff = aggregate(df$Ks, list(df$Family, df$Node), mean)

# plot histogram (try different numbers of bins)
hist(dff$x, n=50)

You may also wish to fit a kernel density to the distribution, but in that case beware that it is advisable to use a boundary correction in order to prevent fitting a spurious peak in the lower part of the Ks distribution. This can also be done with the wgd kde command.

ShuaiNIEgithub commented 4 years ago

Thank you for the detailed answer. In my platform, I can not use any plotting tool except R. So, your R code helped me a lot! In addition, could you please tell me how to prepare the wgd input data for calculating Ks of two genomes? I did not find this in the documentation.

arzwa commented 4 years ago

I assume you mean for a one-to-one ortholog Ks distribution? Just two fasta files with CDS sequences. Note that they should be proper CDS, i.e. translatable DNA sequences (i.e. codon triplets). If you have your two fasta files you just run wgd mcl --cds --one_v_one -s sp1.fasta,sp2.fasta to get the RBH orthologs and than run wgd ksd orthologs.tsv sp1.fasta sp2.fasta to get the Ks distribution.

ShuaiNIEgithub commented 4 years ago

Thank you so much!