Closed xflouris closed 8 years ago
I am not sure if this is what we want. See the following links:
bootstrapping sounds nice :-)
you may also want to talk to people from the comp. stats. group at HITS, that's what they are here for ...
alexis
On 30.11.2015 11:06, Sarah Lutteropp wrote:
I am not sure if this is what we want. See the following links:
— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/mptp/issues/57#issuecomment-160584485.
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson
www.exelixis-lab.org
Hi @stamatak and @algomaus , I believe we took the wrong path (conceptually)to estimate the confidence interval (credible intervals in our case). Since we are doing an MCMC, what we need is the highest posterior density (HPD) interval, and for that we can use Chen-Shao's algorithm. It should be fairly simple to implement.
@algomaus : we will need to sum posterior probabilities for delimitations instead of frequencies. If you want, we can work on it together on friday.
Hi @xflouris, just to be sure: Will we still get a confidence interval telling us something about the number of species? Since different species numbers could have the same or a highly similar loglikelihood...
Yes, exactly. We will get a credible interval for the number of species, but it must be based on the posterior probabilities, and not on the frequencies as we were doing so far.
but the chains are supposed to sample proportionally to the PP, so the frequency should be an allowed measure,
alexis
On 30.11.2015 12:32, Tomas Flouri wrote:
Yes, exactly. We will get a credible interval for the number of species, but it must be based on the posterior probabilities, and not on the frequencies as we were doing so far.
— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/mptp/issues/57#issuecomment-160605741.
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson
www.exelixis-lab.org
I might have expressed my statement too strictly. I didn't mean that frequency is not an allowed measure, but that for MC-based bayesian analysis, the HPD is the appropriate way of summarizing credible intervals when the marginal distribution is not symmetric, as stated in http://link.springer.com/chapter/10.1007/978-1-4612-1276-8_7. In the end, it all boils down to what we intend to say, i.e. credible vs confidence interval.
The algorithm for estimating the HPD is similar to what I describe in the first post of this thread, except that we need to compute the cumulative density (i.e. array of prefix sums of posterior probabilities instead of frequencies) for the number of species. Then we need to mark the <0.25,0.975> interval as the credible interval. This way we compute a valid credible interval stating that, with a chance of 95%, the true number of species lies within the credible interval.
We have anyway concluded that credible intervals might not be very informative (the ML delimitation is often outside of the intervals) which is caused by the very flat likelihood landscapes. However, the support values seem to be in full agreement with the ML tree.
hi tomas,
okay, then it all makes sense,
alexis
On 30.11.2015 14:12, Tomas Flouri wrote:
I might have expressed my statement too strictly. I didn't mean that frequency is not an allowed measure, but that for MC-based bayesian analysis, the HPD is the appropriate way of summarizing credible intervals when the marginal distribution is not symmetric, as stated in http://link.springer.com/chapter/10.1007/978-1-4612-1276-8_7. In the end, it all boils down to what we intend to say, i.e. credible vs confidence interval.
The algorithm for estimating the HPD is similar to what I describe in the first post of this thread, except that we need to compute the cumulative density (i.e. array of prefix sums of posterior probabilities instead of frequencies) for the number of species. Then we need to mark the <0.25,0.975> interval as the credible interval. This way we compute a valid credible interval stating that, with a chance of 95%, the true number of species lies within the credible interval.
We have anyway concluded that credible intervals might not be very informative (the ML delimitation is often outside of the intervals) which is caused by the very flat likelihood landscapes. However, the support values seem to be in full agreement with the ML tree.
— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/mptp/issues/57#issuecomment-160623499.
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson
www.exelixis-lab.org
@xflouris What is the posterior probability for k species? We could sample different delimitations with different loglikelihoods but with the same number of species... Do we just take the highest encountered loglikelihood there?
In our case the posterior is just the log-likelihood of each delimitation since we use a uniform prior. What we need to do is sum the posteriors of all samples. I'll try to explain the method for computing HPD in steps:
frequencies
array (i.e. frequencies[X]
) by 1 each time we obtain a sample (delimitation) of X species. Instead of the frequencies, we'll need to store the sums of log-likelihoods of all samples (delimitations) of X species. It doesn't matter that the delimitation structure is different. The change we need to make is that long * frequencies
is renamed to something like double * logl_sums
.logl_sums
, we sum up all its elements (let's call this sum double all_logl_sum
).C
is the confidence we want for the HPD (in our case 0.95). We compute the threshold threshold = C * all_logl_sum
.logl_sums
such that we also keep the number of species each sum refers to, i.e. we have two arrays: logl_sums_sorted
(from max to min) and species_count
, where species_count[i]
denotes the number of species for logl_sums_sorted[i]
.logl_sums_sorted
(starting from element 0) until you reach threshold
. Let's assume threshold is reached at index N.species_count[0]
to species_count[N]
. From that list, get the minimum (a
) and the maximum (b
). Our HPD is (a
,b
).Just an update on this:
@algomaus : I'll try to implement this now such that Paschalia does the experiments in the morning. It would be great if you can double-check my implementation afterwards.
I think we need to differentiate:
Tomas have you talked to Fabian from comp. stats. about this, he is their expert on MCMC.
Alexis
On 30.11.2015 23:31, Tomas Flouri wrote:
Just an update on this:
1.
I've talked with the guys from the statistics group -- they don't have experience with MCMC and credible intervals.
2.
I've just checked Jiajie's bayesian implementation and he also computes the HPD in the same way as described in the previous post.
@algomaus https://github.com/algomaus : I'll try to implement this now such that Paschalia does the experiments in the morning. It would be great if you can double-check my implementation afterwards.
— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/mptp/issues/57#issuecomment-160782767.
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson
www.exelixis-lab.org
I've only talked to Evgeni, I'll try talking to Fabian tomorrow.
We have implemented two types of credible intervals.
--bayes_credible
).CCI is what I explained in the previous post with sorting the bins (number of species). In case we obtain a bimodal distribution in the end (though we have not observed that in the datasets we tried), the CCI may include the central part making the interval a bit wider than what would be the HPD by definition (the HPD would give a disconnected interval).
We noticed that HPD is indeed disconnected (i.e. the credible interval is non-continuous) during the burn-in phase (which is expected) but after stabilization, HPD = CCI.
Another thing is that in our implementation, HPD is now generated from the delimitation likelihoods and not by frequencies. Frequencies could be used as well, since the delimitations are taken from the posterior distribution, but using the likelihoods is just a little bit more informative (puts a weight on the frequencies). On the tested datasets, the two methods (frequencies and likelihoods) give the same intervals.
@stamatak : The HPD/CCI implementation now completes point (1) you mention. Identical delimitations (point 2) are rarely seen (almost never) during our MCMC run. However, confidence for clades is summarized by the posterior probabilities (support values at each node), and an overall support for the ML tree is also computed (based on the bayesian runs).
I think we are done now with this issue. If there are no objections from anyone, I'd close the ticket.
yes sounds good, great that you talked to fabian :-)
alexis
On 03.12.2015 16:12, Tomas Flouri wrote:
We have implemented two types of credible intervals.
- A correct HPD (by definition) with a variable threshold (specified by option |--bayes_credible|).
- CCI (central credible interval, also specified by --bayes_credible)
CCI is what I explained in the previous post with sorting the bins (number of species). In case we obtain a bimodal distribution in the end (though we have not observed that in the datasets we tried), the CCI may include the central part making the interval a bit wider than what would be the HPD by definition (the HPD would give a disconnected interval).
We noticed that HPD is indeed disconnected (i.e. the credible interval is non-continuous) during the burn-in phase (which is expected) but after stabilization, HPD = CCI.
Another thing is that in our implementation, HPD is now generated from the delimitation likelihoods and not by frequencies. Frequencies could be used as well, since the delimitations are taken from the posterior distribution, but using the likelihoods is just a little bit more informative (puts a weight on the frequencies). On the tested datasets, the two methods (frequencies and likelihoods) give the same intervals.
@stamatak https://github.com/stamatak : The HPD/CCI implementation now completes point (1) you mention. Identical delimitations (point 2) are rarely seen (almost never) during our MCMC run. However, confidence for clades is summarized by the posterior probabilities (support values at each node), and an overall support for the ML tree is also computed (based on the bayesian runs).
I think we are done now with this issue. If there are no objections from anyone, I'd close the ticket.
— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/mptp/issues/57#issuecomment-161670433.
Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson
www.exelixis-lab.org
finished.
Running the command:
delimit --bayes_multi 5000 --tree_file rooted.inferred_unique_taxa.1.txt --output_file example --seed 777 --bayes_sample 50 --bayes_startrandom --bayes_log
I get the following confidence interval output:
95% confidence interval on number of species: <-178.919800 , 977.080200>
@algomaus : Can you have a look at it when you have time? The issue is that we still (wrongly) assume the distribution to be symmetric around the mean, since at each check (for 95%) we enlarge the interval by 1 from each side.
I believe the fix should be simple:
Below is attached the tree file.
rooted.inferred_unique_taxa.1.txt