amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
379 stars 64 forks source link

Empirical protein models should normalize fixed and user-supplied amino-acid frequencies #46

Closed bredelings closed 5 years ago

bredelings commented 5 years ago

Hi,

The fixed amino-acid frequencies supplied by models like WAG and LG do not actually sum to 1.0. This doesn't really affect the rate matrix, since it is going to be rescaled to a fixed rate anyway. However, the frequencies are also used to compute the likelihood at the root. The difference is pretty small, about 1.0e-6, but still wrong I think. Perhaps if user-supplied frequencies are different is more than 1% you should error out instead of rescaling?

It would be nice to fix this so that raxml-ng passes the testiphy tests lg/2 and wag/1. Tests with uniform frequencies or user-supplied frequencies that sum to 1 already pass.

Since I guess this is also a libpll issue, let me know if I should file another bug there.

amkozlov commented 5 years ago

Hi @bredelings ,

sorry for the delay, long vacations :)

Not sure I understand your question. For WAG, we have (https://github.com/xflouris/libpll-2/blob/master/src/maps.c#L1015):

const double pll_aa_freqs_wag[20] =
 {
   0.0866279, 0.043972,  0.0390894, 0.0570451, 0.0193078,
   0.0367281, 0.0580589, 0.0832518, 0.0244313, 0.048466,
   0.086209,  0.0620286, 0.0195027, 0.0384319, 0.0457631,
   0.0695179, 0.0610127, 0.0143859, 0.0352742, 0.0708957
 };

which does sum to 1.0. Same for LG. Or do you mean that base freq should not sum to 1.0 because that's how the matrices were originally defined? I also noticed that IQTree also fails lg/2 and wag/1 tests, since it probably has the freqs defined in the same way. So are you sure our definition is wrong and we should change it? And if yes, then in which way?

Thanks in advance, Alexey

amkozlov commented 5 years ago

After discussing this issue with @stamatak, we agreed that re-normalizing the original base frequencies is a better way to go compared to the adhoc solution we have used previously (adding +-10e-6 to the last frequency). I have now modified (hopefully) all affected matrices in libpll, and so raxml-ng currently passes all tests in testiphy.

I will also add a suggest "1% deviation" check for user-specified frequencies in raxml-ng.

@bredelings : thanks for pointing out!

bredelings commented 5 years ago

Thanks, and sorry for not responding earlier.

After figuring out I needed the dev branch, I can see the tests pass now.

amkozlov commented 5 years ago

Great, thanks!