arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
83 stars 41 forks source link

Extract CDS for WGD events #50

Closed joehagmann closed 3 years ago

joehagmann commented 3 years ago

This tool helped my analysis a lot, Arthur. I have a question to understand the output files. How can I derive the CDS assigned to WGD events from the wgd mix output tsv? I see the gene families but don't see an obvious way to extract the corresponding CDS pair per row.

Concrete on my data: content of the GMM mix output: image content of the ksd output for gene family 1: image Can I extract the CDS pairs of the ksd output from the rows in the mix output? (I might compare the stats like alignment cov, id and length, but is there a more unique way in doing it?)

Let me know if you need more information.

arzwa commented 3 years ago

The mixture modeling tools use as data the node-averaged Ks values, which are the Ks values estimated for nodes in the gene family trees. So each Family-Node combination (row) in the wgd mix output corresponds to a bunch of gene pairs in the relevant family that have this node as most recent common ancestor. The associated pairs you can find in the ksd output. So the way to get pairs for a mixture component (which I guess corresponds with a putative WGD) is to identify the relevant rows of the mixture output and then identify the gene pairs for those Family - Node combinations. Does that make it somewhat clear?

joehagmann commented 3 years ago

Thanks a lot, that clarified it. There are quite a few entries in the ksd output of one of the genomes I look at with empty values in the columns 'node' and 'distance', just to let you know in case this is not intended.