Pas-Kapli / mptp

mPTP - a tool for single-locus species delimitation
GNU Affero General Public License v3.0
24 stars 5 forks source link

Mistake in 95 confidence interval computation #57

Closed xflouris closed 8 years ago

xflouris commented 8 years ago

Running the command:

delimit --bayes_multi 5000 --tree_file rooted.inferred_unique_taxa.1.txt --output_file example --seed 777 --bayes_sample 50 --bayes_startrandom --bayes_log

I get the following confidence interval output:

95% confidence interval on number of species: <-178.919800 , 977.080200>

@algomaus : Can you have a look at it when you have time? The issue is that we still (wrongly) assume the distribution to be symmetric around the mean, since at each check (for 95%) we enlarge the interval by 1 from each side.

I believe the fix should be simple:

  1. Compute an array of prefix sums for the frequencies array.
  2. Find the mean.
  3. Do the same steps that you do with the code (i.e. increase interval by 1 at both sides), but also check whether you reached the 2.5% limit at each side (using the prefix sums array). Once you reach the 2.5% limit at a side, stop enlarging the confidence interval from that side.

Below is attached the tree file.

rooted.inferred_unique_taxa.1.txt

lutteropp commented 8 years ago

I am not sure if this is what we want. See the following links:

http://stats.stackexchange.com/questions/16516/how-can-i-calculate-the-confidence-interval-of-a-mean-in-a-non-normally-distribu

http://stats.stackexchange.com/questions/112829/how-do-i-calculate-confidence-intervals-for-a-non-normal-distribution

stamatak commented 8 years ago

bootstrapping sounds nice :-)

you may also want to talk to people from the comp. stats. group at HITS, that's what they are here for ...

alexis

On 30.11.2015 11:06, Sarah Lutteropp wrote:

I am not sure if this is what we want. See the following links:

http://stats.stackexchange.com/questions/16516/how-can-i-calculate-the-confidence-interval-of-a-mean-in-a-non-normally-distribu

http://stats.stackexchange.com/questions/112829/how-do-i-calculate-confidence-intervals-for-a-non-normal-distribution

— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/mptp/issues/57#issuecomment-160584485.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

xflouris commented 8 years ago

Hi @stamatak and @algomaus , I believe we took the wrong path (conceptually)to estimate the confidence interval (credible intervals in our case). Since we are doing an MCMC, what we need is the highest posterior density (HPD) interval, and for that we can use Chen-Shao's algorithm. It should be fairly simple to implement.

@algomaus : we will need to sum posterior probabilities for delimitations instead of frequencies. If you want, we can work on it together on friday.

lutteropp commented 8 years ago

Hi @xflouris, just to be sure: Will we still get a confidence interval telling us something about the number of species? Since different species numbers could have the same or a highly similar loglikelihood...

xflouris commented 8 years ago

Yes, exactly. We will get a credible interval for the number of species, but it must be based on the posterior probabilities, and not on the frequencies as we were doing so far.

stamatak commented 8 years ago

but the chains are supposed to sample proportionally to the PP, so the frequency should be an allowed measure,

alexis

On 30.11.2015 12:32, Tomas Flouri wrote:

Yes, exactly. We will get a credible interval for the number of species, but it must be based on the posterior probabilities, and not on the frequencies as we were doing so far.

— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/mptp/issues/57#issuecomment-160605741.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

xflouris commented 8 years ago

I might have expressed my statement too strictly. I didn't mean that frequency is not an allowed measure, but that for MC-based bayesian analysis, the HPD is the appropriate way of summarizing credible intervals when the marginal distribution is not symmetric, as stated in http://link.springer.com/chapter/10.1007/978-1-4612-1276-8_7. In the end, it all boils down to what we intend to say, i.e. credible vs confidence interval.

The algorithm for estimating the HPD is similar to what I describe in the first post of this thread, except that we need to compute the cumulative density (i.e. array of prefix sums of posterior probabilities instead of frequencies) for the number of species. Then we need to mark the <0.25,0.975> interval as the credible interval. This way we compute a valid credible interval stating that, with a chance of 95%, the true number of species lies within the credible interval.

We have anyway concluded that credible intervals might not be very informative (the ML delimitation is often outside of the intervals) which is caused by the very flat likelihood landscapes. However, the support values seem to be in full agreement with the ML tree.

stamatak commented 8 years ago

hi tomas,

okay, then it all makes sense,

alexis

On 30.11.2015 14:12, Tomas Flouri wrote:

I might have expressed my statement too strictly. I didn't mean that frequency is not an allowed measure, but that for MC-based bayesian analysis, the HPD is the appropriate way of summarizing credible intervals when the marginal distribution is not symmetric, as stated in http://link.springer.com/chapter/10.1007/978-1-4612-1276-8_7. In the end, it all boils down to what we intend to say, i.e. credible vs confidence interval.

The algorithm for estimating the HPD is similar to what I describe in the first post of this thread, except that we need to compute the cumulative density (i.e. array of prefix sums of posterior probabilities instead of frequencies) for the number of species. Then we need to mark the <0.25,0.975> interval as the credible interval. This way we compute a valid credible interval stating that, with a chance of 95%, the true number of species lies within the credible interval.

We have anyway concluded that credible intervals might not be very informative (the ML delimitation is often outside of the intervals) which is caused by the very flat likelihood landscapes. However, the support values seem to be in full agreement with the ML tree.

— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/mptp/issues/57#issuecomment-160623499.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

lutteropp commented 8 years ago

@xflouris What is the posterior probability for k species? We could sample different delimitations with different loglikelihoods but with the same number of species... Do we just take the highest encountered loglikelihood there?

xflouris commented 8 years ago

In our case the posterior is just the log-likelihood of each delimitation since we use a uniform prior. What we need to do is sum the posteriors of all samples. I'll try to explain the method for computing HPD in steps:

  1. At the moment we store the frequency a delimitation of X species appears, by increasing the element (counter) of the frequencies array (i.e. frequencies[X]) by 1 each time we obtain a sample (delimitation) of X species. Instead of the frequencies, we'll need to store the sums of log-likelihoods of all samples (delimitations) of X species. It doesn't matter that the delimitation structure is different. The change we need to make is that long * frequencies is renamed to something like double * logl_sums.
  2. Once the bayesian run finishes and we have the array logl_sums, we sum up all its elements (let's call this sum double all_logl_sum).
  3. Let's suppose C is the confidence we want for the HPD (in our case 0.95). We compute the threshold threshold = C * all_logl_sum.
  4. We sort in descending order (from max to min) the sums in array logl_sums such that we also keep the number of species each sum refers to, i.e. we have two arrays: logl_sums_sorted (from max to min) and species_count, where species_count[i] denotes the number of species for logl_sums_sorted[i].
  5. Iteratively add up the values logl_sums_sorted (starting from element 0) until you reach threshold. Let's assume threshold is reached at index N.
  6. Make a list of all numbers from species_count[0] to species_count[N]. From that list, get the minimum (a) and the maximum (b). Our HPD is (a,b).
xflouris commented 8 years ago

Just an update on this:

  1. I've talked with the guys from the statistics group -- they don't have experience with MCMC and credible intervals.
  2. I've just checked Jiajie's bayesian implementation and he also computes the HPD in the same way as described in the previous post.

@algomaus : I'll try to implement this now such that Paschalia does the experiments in the morning. It would be great if you can double-check my implementation afterwards.

stamatak commented 8 years ago

I think we need to differentiate:

  1. Summary statistics just for the plain species count, regardless of the "shape" of the delimitation, i.e., just numbers
  2. Summary statistics for identical delimitations, i.e., if two distinct delimitations have the same #species we wouldn't count this here

Tomas have you talked to Fabian from comp. stats. about this, he is their expert on MCMC.

Alexis

On 30.11.2015 23:31, Tomas Flouri wrote:

Just an update on this:

1.

I've talked with the guys from the statistics group -- they don't
have experience with MCMC and credible intervals.

2.

I've just checked Jiajie's bayesian implementation and he also
computes the HPD in the same way as described in the previous post.

@algomaus https://github.com/algomaus : I'll try to implement this now such that Paschalia does the experiments in the morning. It would be great if you can double-check my implementation afterwards.

— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/mptp/issues/57#issuecomment-160782767.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

xflouris commented 8 years ago

I've only talked to Evgeni, I'll try talking to Fabian tomorrow.

xflouris commented 8 years ago

We have implemented two types of credible intervals.

  1. A correct HPD (by definition) with a variable threshold (specified by option --bayes_credible).
  2. CCI (central credible interval, also specified by --bayes_credible)

CCI is what I explained in the previous post with sorting the bins (number of species). In case we obtain a bimodal distribution in the end (though we have not observed that in the datasets we tried), the CCI may include the central part making the interval a bit wider than what would be the HPD by definition (the HPD would give a disconnected interval).

We noticed that HPD is indeed disconnected (i.e. the credible interval is non-continuous) during the burn-in phase (which is expected) but after stabilization, HPD = CCI.

Another thing is that in our implementation, HPD is now generated from the delimitation likelihoods and not by frequencies. Frequencies could be used as well, since the delimitations are taken from the posterior distribution, but using the likelihoods is just a little bit more informative (puts a weight on the frequencies). On the tested datasets, the two methods (frequencies and likelihoods) give the same intervals.

@stamatak : The HPD/CCI implementation now completes point (1) you mention. Identical delimitations (point 2) are rarely seen (almost never) during our MCMC run. However, confidence for clades is summarized by the posterior probabilities (support values at each node), and an overall support for the ML tree is also computed (based on the bayesian runs).

I think we are done now with this issue. If there are no objections from anyone, I'd close the ticket.

stamatak commented 8 years ago

yes sounds good, great that you talked to fabian :-)

alexis

On 03.12.2015 16:12, Tomas Flouri wrote:

We have implemented two types of credible intervals.

  1. A correct HPD (by definition) with a variable threshold (specified by option |--bayes_credible|).
  2. CCI (central credible interval, also specified by --bayes_credible)

CCI is what I explained in the previous post with sorting the bins (number of species). In case we obtain a bimodal distribution in the end (though we have not observed that in the datasets we tried), the CCI may include the central part making the interval a bit wider than what would be the HPD by definition (the HPD would give a disconnected interval).

We noticed that HPD is indeed disconnected (i.e. the credible interval is non-continuous) during the burn-in phase (which is expected) but after stabilization, HPD = CCI.

Another thing is that in our implementation, HPD is now generated from the delimitation likelihoods and not by frequencies. Frequencies could be used as well, since the delimitations are taken from the posterior distribution, but using the likelihoods is just a little bit more informative (puts a weight on the frequencies). On the tested datasets, the two methods (frequencies and likelihoods) give the same intervals.

@stamatak https://github.com/stamatak : The HPD/CCI implementation now completes point (1) you mention. Identical delimitations (point 2) are rarely seen (almost never) during our MCMC run. However, confidence for clades is summarized by the posterior probabilities (support values at each node), and an overall support for the ML tree is also computed (based on the bayesian runs).

I think we are done now with this issue. If there are no objections from anyone, I'd close the ticket.

— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/mptp/issues/57#issuecomment-161670433.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

xflouris commented 8 years ago

finished.