Warnings and errors during ksd analysis

alslonik commented 4 years ago

Hi, I am performing ksd with wgd and getting loads of warnings and errors like this: I.e:

2020-06-21 11:34:49: WARNING    No ω value for PG7P017240 - PG2P020626!
2020-06-21 11:34:56: ERROR  Not all gene pairs present in /home/alex/wgd_venv/wgd/ks_tmp.3895c2cde7f7ac/GF_000303.codeml

What is the reason for this? Not sure what should be attached for you to answer, but maybe you will be able to give me a hint anyway..?

arzwa commented 4 years ago

Hi, by default codeml is run only on the gap-free sites of the alignment. For some families there will be no or very few sites after stripping gaps, and they will not contribute to the Ks distribution. Note that there is also a --pairwise mode which will run codeml on pairwise gap-stripped alignments.

I introduced the ERROR: Not all gene pairs present in ... when addressing #33 , however I guess it should be a warning.

alslonik commented 4 years ago

Right. Get it. Thanks!

joehagmann commented 3 years ago

Hi Arthur, I stumbled upon the same error for each family actually, and started the tool now in the --pairwise mode. Does using this mode have an effect on the Ks distribution? Thanks a lot for a quick clarification.

arzwa commented 3 years ago

Both modes estimate Ks between pairs of codon-level aligned sequences.

In normal mode the entire codon-level alignment is provided as input to codeml, the program that estimates Ks etc. by maximum likelihood. Codeml will by default remove all columns of the entire alignment that contain gaps (a somewhat weird behavior IMHO, see also this), so if the alignment matrix contains a sequence that is not well aligned, many columns may be removed, also for well-aligned pairs, so you may throw out a lot of informative data this way in some bad cases. On the other hand, some parameters for codeml (such as the base codon frequencies) may be better estimated in this mode since we're providing the entire alignment matrix.
In pairwise mode pairs of sequences are extracted from the complete family codon-level alignment and used independently as input to codeml. In this case all aligned codons between a pair of sequences will be used to estimate Ks, Ka, etc. When there are many families that have some sequences that are not very well aligned, this mode may result in considerably more pairs of sequences for which molecular distances can be estimated.

I hope this makes it somewhat clearer. See for instance also the relevant code on the dev branch (which is on the dev branch, where I'm working on an improved version of wgd).

joehagmann commented 3 years ago

Perfect, thanks for this elaborate explanation!

arzwa commented 3 years ago

Oh, and BTW, concerning:

I stumbled upon the same error for each family actually

The larger the alignment, the more probable that there are few gap-free columns, and since wgd starts with the biggest families, one may get the impression that this 'error' is occurring for almost all families, but usually this will only be for the largest families which are analyzed first (which tend not to contribute much to the range of interest in Ks distribution anyway in my experience).

arzwa / wgd

Warnings and errors during ksd analysis #38