arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 41 forks source link

Warnings and errors during ksd analysis #38

Closed alslonik closed 2 years ago

alslonik commented 4 years ago

Hi, I am performing ksd with wgd and getting loads of warnings and errors like this: I.e:

2020-06-21 11:34:49: WARNING    No ω value for PG7P017240 - PG2P020626!
2020-06-21 11:34:56: ERROR  Not all gene pairs present in /home/alex/wgd_venv/wgd/ks_tmp.3895c2cde7f7ac/GF_000303.codeml

What is the reason for this? Not sure what should be attached for you to answer, but maybe you will be able to give me a hint anyway..?

arzwa commented 4 years ago

Hi, by default codeml is run only on the gap-free sites of the alignment. For some families there will be no or very few sites after stripping gaps, and they will not contribute to the Ks distribution. Note that there is also a --pairwise mode which will run codeml on pairwise gap-stripped alignments.

I introduced the ERROR: Not all gene pairs present in ... when addressing #33 , however I guess it should be a warning.

alslonik commented 4 years ago

Right. Get it. Thanks!

joehagmann commented 3 years ago

Hi Arthur, I stumbled upon the same error for each family actually, and started the tool now in the --pairwise mode. Does using this mode have an effect on the Ks distribution? Thanks a lot for a quick clarification.

arzwa commented 3 years ago

Both modes estimate Ks between pairs of codon-level aligned sequences.

I hope this makes it somewhat clearer. See for instance also the relevant code on the dev branch (which is on the dev branch, where I'm working on an improved version of wgd).

joehagmann commented 3 years ago

Perfect, thanks for this elaborate explanation!

arzwa commented 3 years ago

Oh, and BTW, concerning:

I stumbled upon the same error for each family actually

The larger the alignment, the more probable that there are few gap-free columns, and since wgd starts with the biggest families, one may get the impression that this 'error' is occurring for almost all families, but usually this will only be for the largest families which are analyzed first (which tend not to contribute much to the range of interest in Ks distribution anyway in my experience).