EPRV3EvidenceChallenge / Inputs

Input Data & Model for the EPRV3 Evidence Challenge - Start Here
MIT License
11 stars 10 forks source link

Finding the true evidence for n_planets=1 #11

Open JohannesBuchner opened 6 years ago

JohannesBuchner commented 6 years ago

Very fine grids or similar methods should allow us, albeit perhaps with some severe computational cost, to find the true value for n_planets=1. This is a 7d parameter space and should only have a single mode (unless the phase is near the border 0/2pi?).

To rank methods by their quality, it is important to know the true evidence. If we do not know the true value, we can only look at the scatter between methods; poor-quality methods can introduce scatter and make all methods look unreliable. If we have the true value we can then understand the biases and scatter of various methods.

I suggest we report approaches and results here?

JohannesBuchner commented 6 years ago

I just want to say that I am working on this, and I think I can produce some results. However, I have some concerns about posting them, because it may ruin the blind experiment. There are some algorithms already failing at n_planets=1, so this can be used as a test/discriminator. However, if the value is known, teams may tune for the true value, which they otherwise would/could not. @eford , can we set a deadline for revealing the true values, so that in the publication we really have a blind experiment, and we only consider algorithms merged before that point?

eford commented 6 years ago

Hi Joannes,

I had proposed Sept 14 for the deadline for results to be included in publications at https://github.com/EPRV3EvidenceChallenge/Inputs/tree/master/tasks . So far no one has expressed a desire for an earlier/later date.

It's not clear to me if it would be helpful to "reveal" the true parameters used to generate the datasets before then, so that people improve their estimates. If people were willing to reanalyze another set of data files, then I'd be less nervous about revealing the true properties before Sept

  1. But if people feel that they won't be able to spend the time to analyze another set of datasets, then I'm inclined to keep the true values locked away on Ben's computer.

Cheers, Eric

On Tue, Aug 15, 2017 at 12:24 PM, Johannes Buchner <notifications@github.com

wrote:

I just want to say that I am working on this, and I think I can produce some results. However, I have some concerns about posting them, because it may ruin the blind experiment. There are some algorithms already failing at n_planets=1, so this can be used as a test/discriminator. However, if the value is known, teams may tune for the true value, which they otherwise would/could not. @eford https://github.com/eford , can we set a deadline for revealing the true values, so that in the publication we really have a blind experiment, and we only consider algorithms merged before that point?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/EPRV3EvidenceChallenge/Inputs/issues/11#issuecomment-322516226, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQywjcSRv4SSbnWaR4tr63iMMQXKhPdks5sYcYlgaJpZM4O21wQ .

-- Eric Ford Professor of Astronomy & Astrophysics Center for Exoplanets & Habitable Worlds Center for Astrostatistics Institute for CyberScience Penn State Astrobiology Research Center Pennsylvania State University

JohannesBuchner commented 6 years ago

I am not talking about the true parameters, but the true logZ values produced by fine integration.

eford commented 6 years ago

Technically, no one knows the "true log Z values". Since anyone could try to compute these by brute force, I see no harm is anyone sharing the results of their attempt, so people don't have to reinvent the wheel.

Cheers, Eric

On Tue, Aug 15, 2017 at 12:55 PM, Johannes Buchner <notifications@github.com

wrote:

I am not talking about the true parameters, but the true logZ values produced by fine integration.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/EPRV3EvidenceChallenge/Inputs/issues/11#issuecomment-322524556, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQywlaxIv630Z7bAkM1qUhhCD-Trd-kks5sYc2QgaJpZM4O21wQ .

-- Eric Ford Professor of Astronomy & Astrophysics Center for Exoplanets & Habitable Worlds Center for Astrostatistics Institute for CyberScience Penn State Astrobiology Research Center Pennsylvania State University

JohannesBuchner commented 6 years ago

Hi Eric,

I think you are saying that you would like to make the test on the level of Bayes factors decisions, given that in the end Ben knows the true number of planets.

So one would define a procedure, such as: if logZ increases by more than logBcritical from 0->1, accept 1 planet, otherwise report no planets if logZ increases by more than logBcritical from 1->2, accept 2 planets, otherwise report 1 planets if logZ increases by more than logBcritical from 2->3, report 3 planets, otherwise report 2 planets

and then we measure the smallest threshold logBcritical for each algorithm that does not give false positives (tune for low type I error), i.e. reports too many planets. At that threshold we can say how many planets the algorithm successfully detected, and rank algorithms by the number of true positives (rank by type II error). That would be a pretty standard way for statisticians to characterise estimators.

An algorithm with unreliable logZ estimates would have large scatter, therefore large logBcritical would be assigned to it, therefore would be insensitive to planets (lower number of true positives) and ranked lower.

There are some subtleties (need to agree on a procedure; what if logZ decreases from 0->1 but increases drastically from 0->2; what if the # of planets detected is right, but the period is completely off; number of data sets is limited), but overall I think this could work.

Is that roughly what you had in mind? I agree that the true parameters should stay locked away (and also think they would not be that useful for improving the integration).

Cheers, Johannes

eford commented 6 years ago

If every method could be applied to analyze >1000 datasets, then your proposal would make a lot of sense. With just 6 data sets, we can't characterize the threshold for a given false Discovery rate precisely.

So for the Sept 14 deadline, I suggest that we simply compare the results and perhaps the computing effort required.

Over lunch, a few of us discussed the possibility of a subsequent follow-up paper that would demand methods that are practical to run on > 10^5 models. It wouldn't be necessary for every method to participate in the second challenge. Ideally, the first challenge would help us pick the most promising methods and parameters to be worth running the now extensive tests. Would your team be up for that (obviously on a longer timescale)?

Cheers, Eric

On Aug 15, 2017 2:08 PM, "Johannes Buchner" notifications@github.com wrote:

Hi Eric,

I think you are saying that you would like to make the test on the level of Bayes factors decisions, given that in the end Ben knows the true number of planets.

So one would define a procedure, such as: if logZ increases by more than logBcritical from 0->1, accept 1 planet, otherwise report no planets if logZ increases by more than logBcritical from 1->2, accept 2 planets, otherwise report 1 planets if logZ increases by more than logBcritical from 2->3, report 3 planets, otherwise report 2 planets

and then we measure the smallest threshold logBcritical for each algorithm that does not give false positives (tune for low type I error), i.e. reports too many planets. At that threshold we can say how many planets the algorithm successfully detected, and rank algorithms by the number of true positives (rank by type II error). That would be a pretty standard way for statisticians to characterise estimators.

An algorithm with unreliable logZ estimates would have large scatter, therefore large logBcritical would be assigned to it, therefore would be insensitive to planets (lower number of true positives) and ranked lower.

There are some subtleties (need to agree on a procedure; what if logZ decreases from 0->1 but increases drastically from 0->2; what if the # of planets detected is right, but the period is completely off; number of data sets is limited), but overall I think this could work.

Is that roughly what you had in mind? I agree that the true parameters should stay locked away (and also think they would not be that useful for improving the integration).

Cheers, Johannes

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/EPRV3EvidenceChallenge/Inputs/issues/11#issuecomment-322543816, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQywrJF5wpIvk_2LLC_ODUr9IvoayzAks5sYd6dgaJpZM4O21wQ .

JohannesBuchner commented 6 years ago

Would your team be up for that (obviously on a longer timescale)?

Yes. I think the method validation is the most interesting aspect of the challenge. This should be doable with our methods (or a subset).

JohannesBuchner commented 6 years ago

I am sharing results from trying to obtain the true integral for n_planets=1.

First I tried the algorithms as in #10 . They tend to be very slow; some fail. I would like to focus on data set 0005 where there is the largest discrepancy between (reasonable) algorithms, and therefore seems to be a difficult problem.

For example, I get (after many tens of millions of likelihood evaluations):

evidences_0005.txt

log10Z log10Zerr Method
-170.449154441 0.00434300722277 cuba-Cuhre
-166.050107589 0.539214602112 cuba-Divonne
-168.465776388 0.00434203830296 cuba-Suave
-164.724977486 0.00332222241542 importance-sampling

cuba-Divonne failed, and Cuhre and Suave missed a peak (I know because some multinest runs miss the same peak).

I then did a 40x40 grid in omega/chi and evaluated the integrals with those parameters fixed. I find values up to -168, and estimate a lower limit to the integral of -168.7. However, the highest log10-likelihood multinest found is -153.14.

I have then spent an expensive global parameter space search to identify all maxima. The approach I used is more or less equivalent to running many simulated annealing runs. On these maxima I run importance sampling for a long time until I have 10000 effective samples (efficiency is <1%, so the number of samples is much larger). This seems to give very robust results, from comparing the output to various multinest runs (which sometimes find more, sometimes fewer maxima).

The value I believe is therefore:

log10Z log10Zerr
-164.686379324 0.00438833434242

All data sets

Here are the evidences for all datasets with the expensive global search and long importance sampling:

log10Z log10Zerr Dataset
-193.983761917 0.00278460392468 evidences_0001.txt
-167.792101884 0.00312419249226 evidences_0002.txt
-163.607278175 0.00505747334185 evidences_0003.txt
-160.861911640 0.00567708100766 evidences_0004.txt
-164.686379324 0.00438833434242 evidences_0005.txt
-170.727330373 0.00336778670716 evidences_0006.txt

I would advise not to tweak the algorithms to reproduce these values. Instead I think for the publication it would be interesting if each team studied what algorithm parameters are safe to use, and at what point it fails (to discover small solutions). This can be in terms of safe "number of particles", run iterations or total samples drawn. We plan to do this with several runs at least for MULTINEST. With that in hand, the paper can give recommendations for safe use of the methods and where not to cut corners.

vmaguirerajpaul commented 6 years ago

I suppose we'll be redoing these tests soon with the new priors, though in the mean time here are my 1-planet evidences (old priors) from a slightly longer run of my MCMC/nested sampling code.

log10Z log10Zerror Dataset
-193.98 0.02 evidences_0001.txt
-168.15 0.05 evidences_0002.txt
-163.43 0.04 evidences_0003.txt
-160.56 0.04 evidences_0004.txt
-164.90 0.07 evidences_0005.txt
-170.43 0.03 evidences_0006.txt

All within about a factor of 2 of the evidences given by @JohannesBuchner, though only for the first dataset is there agreement within estimated 1σ (or even 2σ or 3σ)