DT-ModSel calculations are inaccurate

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Start JModelTest and load the primate-mtDNA.nex data file from the 
"example-data" folder.
2. Compute likelihood scores using "BioNJ" base tree, "11" substitution 
schemes, and uncheck "+I" and "+G".
(the bug does not depend on these settings; I just wanted to reduce the model 
set to simplify comparisons).
3. Do DT calculations.

What is the expected output? What do you see instead?

I believe that the correct values are shown in this table (computed using the 
"automodel" command in a forthcoming version of PAUP*):

                                                                   delta
         #  Model               -lnL       K            DT            DT
    --------------------------------------------------------------------
        22  GTR           5934.14686      29      0.000729      0.000000
        20  TVM           5938.56165      28      0.002006      0.001276
        16  TIM2          5948.70508      27      0.007330      0.006600
        18  TIM3          5964.67497      27      0.008686      0.007956
        14  TIM           5970.48937      27      0.009140      0.008410
        12  TPM3uf        5967.29493      26      0.009318      0.008588
        10  TPM2uf        5953.37500      26      0.009349      0.008620
         8  K81uf         5973.23914      26      0.010403      0.009673
         6  TrN           5978.85492      26      0.014152      0.013423
         4  HKY           5981.72023      25      0.015449      0.014720
         7  K81           6132.14429      23      0.015855      0.015126
        11  TPM3          6130.17784      23      0.016391      0.015661
        13  TIMef         6075.94240      24      0.018402      0.017673
         3  K80           6142.42908      22      0.019825      0.019095
        17  TIM3ef        6073.43494      24      0.019870      0.019141
         5  TrNef         6086.21799      23      0.022145      0.021416
         2  F81           6284.99563      24      0.033963      0.033233
        19  TVMef         6045.36435      25      0.035274      0.034544
         1  JC            6424.20245      21      0.035506      0.034777
        21  SYM           5989.38158      26      0.036224      0.035494
         9  TPM2          6058.50316      23      0.037819      0.037090
        15  TIM2ef        6003.14926      24      0.038347      0.037618

Here is what is output by JModelTest:

    * DT MODEL SELECTION : Selection uncertainty

    Model             -lnL    K          DT      delta      weight cumWeight
    ------------------------------------------------------------------------ 
    GTR          5934.1469   29      0.0004     0.0000      0.6515    0.6515 
    TVM          5938.5617   28      0.0027     0.0024      0.0856    0.7371 
    TIM2         5948.7051   27      0.0085     0.0082      0.0272    0.7643 
    TIM3         5964.6750   27      0.0099     0.0095      0.0235    0.7878 
    TIM1         5970.4894   27      0.0106     0.0102      0.0220    0.8098 
    TPM3uf       5967.2950   26      0.0107     0.0103      0.0217    0.8315 
    TPM2uf       5953.3750   26      0.0109     0.0106      0.0212    0.8527 
    TPM1uf       5973.2392   26      0.0120     0.0117      0.0192    0.8720 
    TrN          5978.8549   26      0.0162     0.0159      0.0143    0.8862 
    HKY          5981.7202   25      0.0177     0.0174      0.0131    0.8993 
    TPM1         6132.1443   23      0.0180     0.0176      0.0129    0.9122 
    TPM3         6130.1779   23      0.0185     0.0182      0.0125    0.9247 
    TIM1ef       6075.9424   24      0.0206     0.0203      0.0112    0.9360 
    TIM3ef       6073.4349   24      0.0223     0.0219      0.0104    0.9464 
    K80          6142.4291   22      0.0225     0.0222      0.0103    0.9567 
    TrNef        6086.2180   23      0.0250     0.0246      0.0093    0.9660 
    F81          6284.9956   24      0.0385     0.0382      0.0060    0.9720 
    TVMef        6045.3644   25      0.0399     0.0395      0.0058    0.9778 
    JC           6424.2025   21      0.0403     0.0399      0.0058    0.9836 
    SYM          5989.3816   26      0.0409     0.0405      0.0057    0.9892 
    TPM2         6058.5032   23      0.0427     0.0424      0.0054    0.9946 
    TIM2ef       6003.1493   24      0.0433     0.0430      0.0054    1.0000
    ------------------------------------------------------------------------

What version of the product are you using? On what operating system?

JModelTest 2.1.4, Mac OSX 10.8.5

Please provide any additional information below.

I have done a thorough analysis of why I think the results from PAUP* are the 
correct ones in the attached document.  Also note that for the example above, I 
can get my program to match ModelTest by simulating the coding mistakes 
described in this document.

Original issue reported on code.google.com by dave.swo...@gmail.com on 17 Mar 2014 at 3:07

Attachments:

jmodeltest-dt-bug.pdf

GoogleCodeExporter commented 9 years ago

If you accept my arguments that the code must be changed, here is a better 
method for doing it than using sum(exp(log(dij) − bic[j] + min_bic)).  It is 
mathematically equivalent, just as numerically stable, and avoids an unncessary 
log.  The translation from C to Java should be straightforward.

    //assumptions:
    //  * branch length vectors 'b[i]' of length 'nedges' have been stored
    //    for all models 'i'
    //  * BIC values have already been stored into an array 'bic'
    //  * bic_min has already been determined

    double denom = 0.0;
    for (uint i = 0; i < nmodels; i++)
        {
        /* Note that we don't need to worry about underflow here, as at least
           one of the "bic[i] - bic_min" values will be zero, and any terms
           that underflow to zero when exponentiated would make a negligible
           contribution to the sum anyway.
        */
        bic_like[i] = exp_safe(-0.5*(bic[i] - bic_min));
        denom += bic_like[i];
        }

    for (uint i = 0; i < nmodels; i++)
        {
        double sum = 0.0;
        for (j = 0; j < nmodels; j++)
            if (j != i)
                sum += euclideanDistance(b[i], b[j], nedges)*bic_like[j];
        dt[i] = sum/denom;
        }

Original comment by dave.swo...@gmail.com on 17 Mar 2014 at 4:16

GoogleCodeExporter commented 9 years ago

Replacing the previous PDF...

Original comment by dave.swo...@gmail.com on 17 Mar 2014 at 4:59

Attachments:

jmodeltest-dt-bug.pdf

GoogleCodeExporter commented 9 years ago

That's right. There was a bug in those calculations. It was fixed in the trunk. 
Thank you very much for your detailed report!

Original comment by DiegoDL84 on 28 Mar 2014 at 10:14

Changed state: Fixed
Added labels: Component-Logic

GoogleCodeExporter commented 9 years ago

Hi Dave, we will look into this.

 It is possible I made such mistake 7 years ago when coding DT into Jmodeltest 0.1, although I did compare my output with that of DTModSel and the differences were minimal, so I thought they were due to using Phyml vs PAUP to get the likelihoods.

In any case we will fix it, thanks much!

So, is PAUP coming back with model selection, bayes, etc?? I hope so!!

Hope all is well, take care

D

Original comment by dposad...@gmail.com on 3 Apr 2014 at 1:41

Changed state: Started

GoogleCodeExporter commented 9 years ago

Original comment by DiegoDL84 on 2 Sep 2014 at 8:30

Changed state: Fixed

ddarriba / jmodeltest2

DT-ModSel calculations are inaccurate #26