anderkve / gledeli

2 stars 0 forks source link

How to plot / weight samples: significance of posterior_weight #9

Closed fzeiser closed 4 years ago

fzeiser commented 4 years ago

I have started on the plotting of the marginalized posteriors and plotting of some samples.
Can you remind me of structure of the GAMBIT scanner output?

The question is connected to this plotting file: https://github.com/anderkve/gambit_np/blob/dev/plotting/plotting/posterior_draw_samples.py; I also attach the plots it creates, figs.zip.

My pseudo code to get the data start's with removing the unvalid data points. I understand it such that they are not valid, as the calculation threw errors; here it's because the likelihood of D0 (calculated first) was below the threshold.

data = load_hdf5_file()
data.keys
# ['p1', 'p2', 'posterior_weight' 'Loglike', 'Loglike_is_valid'] 

# remove unvalid points
data = data['Loglike_is_valid' = True]

I'm not 100% sure whether I understand the posterior_weight correctly. I think I got too used to (Multinests)[https://github.com/JohannesBuchner/MultiNest] post_equal weights format, which

Contains the equally weighted posterior samples. Columns have parameter values followed by loglike value.

But from the plotting code we prepared for gambit_light, a histogram of the marginalized posteriors was created like by weighting with the posterior_weights:

histogram(samples=data['p1'], 
          weights=data['posterior_weights']

So if I want to plot some random samples, model(θ), of the posterior distribution θ, how should I do this.If this was a the posterior was a gaussian, I would just draw from a gaussian. But how should I process the posterior_weights?

Assume I just want to plot something like:

posterior_samples = data.iterate()
for p in posterior_samples:
    plot(model(p))

If I do this rather plain in the code, this gives very strange results. Almost as if I don't have constrains. So I guess I'd have to produce a posterior_samples_equally_weighted procedure, right?

Alternatively, can I sample this way?

posterior_samples = data.iterate()
for p in posterior_samples:
    plot(model(p), transparency=p['posterior_weights'])

I guess this should be fine. But for a concrete implementation, it is important to know how I iterate through data, or how data is sorted. So: How is it sorted when it comes directly from the HDF5 file?

Now maybe I also want to ask another question; I could be interested in finding the most likely θ's, and connected to that plot the most likely predictions model(θ). What is the best way to do this: I could sort data by the likelihood for each sample. But do I need to case about the posterior weight ?

For some reason I don't recall, I sorted by posterior_weight in gambit_light, not by the likelihood. But looking at the problem again, I don't know what the significance of that it.

fzeiser commented 4 years ago

some example plots from the code / zipfile above: gsf_random samples gsf_sampels sorted by likelihood gsf_sampels sorted by posterior_weight

For the nld the "random samples" look even weirder (but hte other ones are well constrained) to something meaningful nld_random samples nld_sampels sorted by posterior_weight

anderkve commented 4 years ago

Sorry, this reply will be a bit quick -- can come back with more details later if needed.

My pseudo code to get the data start's with removing the unvalid data points. I understand it such that they are not valid, as the calculation threw errors; here it's because the likelihood of D0 (calculated first) was below the threshold.

Yep, that's correct. If the calculations for a point did not succeed, or the total loglike was too low, the point ends up with LogLike_is_valid = False. But keeping such points in the first place is mostly intended for debug purposes. I'm pretty sure the GAMBIT version we use has a YAML option called print_invalid_points which we can set to false to stop these points from being saved to the hdf5 file in the first place. This option belongs to the likelihood block in the KeyValues part of the yaml file:

KeyValues:

  # Choose a lower cutoff for the likelihood
  likelihood:
    model_invalid_for_lnlike_below: -5e5
    print_invalid_points: false

So if I want to plot some random samples, model(θ), of the posterior distribution θ, how should I do this.If this was a the posterior was a gaussian, I would just draw from a gaussian. But how should I process the posterior_weights?

A simple way to draw samples proportional to the posterior weights is to do hit/miss sampling using the posterior weight:

  1. Find the max value in the posterior_weights dataset. (max_posterior_weight)
  2. Draw a random sample (uniform probability for all samples)
  3. Draw a uniform random number r scaled to the range 0 -- max_posterior_weight
  4. Check the posterior_weight for the given sample. If it is >= r then keep it, else reject it.
  5. Repeat steps 2 -- 4 until you have enough samples.

The surviving samples are now a set of "equally weighted" samples, so from this point on you can forget about the posterior weights of those samples. But note that if you pick many such samples (i.e. a significant size compared to the original size of your total data set) you will of course end up with repeated samples. That's OK.

The samples in the GAMBIT hdf5 file appear in the order they were written to the file. So generally you will have the worst points early in the datasets (corresponding to early in the scan), and better points later -- but the exact ordering will vary with scanning algorithm, printer buffer length etc. So you should make sure to order points in whatever way you prefer for the analysis you are doing.

Now maybe I also want to ask another question; I could be interested in finding the most likely θ's, and connected to that plot the most likely predictions model(θ). What is the best way to do this: I could sort data by the likelihood for each sample. But do I need to case about the posterior weight ?

No need to worry about the posterior weights when you focus on the highest-likelihohood points. So then you can just sort the your datasets according to their LogLike value and plot in such an order that the max-LogLike sample is plotted on top.

Sorting by posterior weight just gives you a different perspective. That way you focus on the max-posterior point, which is also an interesting point to look at.

In 1D GAMBIT plots -- typically for θ's directly, rather than of functions like model(E;θ) -- we often show both a posterior distribution and a likelihood ratio (or log-like difference) in the same plot. See e.g. (the not very pretty) Fig 4 in https://arxiv.org/pdf/1705.07931.pdf

Also, if you plot 2D profile likelihood maps or 2D posterior distributions, or even the plots like your 1D model(E;θ) plots, it might be interesting to indicate both the max-likelihood sample and the max-posterior sample. Looking at the separation between these two samples is a simple way to gauge the impact of the prior. In the GAMBIT paper above there's an example in Fig 5, which is a 2D posterior, but where the max-likelihood point is still marked by a grey star.

fzeiser commented 4 years ago

Thanks for clarification. I realized that I did not plot random samples, when I thought I did, but I plotted the first n samples. That why they looked so bad. If I infact draw from equally weighted samples, it looks much better. I also saw that there is convenience methods already included in pandas where I can draw simply by stating:

sample = res_gsf.sample(n=100, weights="posterior_weights")

gsf_random samples_eqweight

Also, if you plot 2D profile likelihood maps or 2D posterior distributions, or even the plots like your 1D model(E;θ) plots, it might be interesting to indicate both the max-likelihood sample and the max-posterior sample

Makes sense! I can add the max likelihood and the max posterior to the same plot later on.

In 1D GAMBIT plots -- typically for θ's directly, rather than of functions like model(E;θ) -- we often show both a posterior distribution and a likelihood ratio (or log-like difference) in the same plot.

The posterior plot is fine/easy. But how do you create the likelihood ratio plot? Is it as simple as this?

histogram(samples=data['p1'], 
          weights=data['LogLike'] - data['LogLike'].max()

Maybe you can help me to get the connection to the profile likelihood: I though for that one I would, instead of integrating the other dimensions, maximize the likelihood in the other dimensions. Now this is somewhat difficult if I only have samples and not an analytical expression... That's somehow probably the reason that one ends up with this onedata['LogLike'] - data['LogLike'].max()?

Assuming the treatment above is correct, I would get a figure like this for the "temperature" parameter of the nld. So this says that here the prior information was having a large impact. For other parameters, the distributions mostly overlapped. Here, the max. prior and max likelihood points are roughly at the same spot, but that's varying. nld_T_posterior_hist

Anyhow, I think this is very nice for a first quick scan! I can scan a little more / better with multinest and try to compare to DE, too.

anderkve commented 4 years ago

Thanks for the plots! This is great! :)

I also saw that there is convenience methods already included in pandas where I can draw simply by stating: sample = res_gsf.sample(n=100, weights="posterior_weights")

Cool -- good to know. (I should learn pandas some day.)

The posterior plot is fine/easy. But how do you create the likelihood ratio plot? Is it as simple as this?

histogram(samples=data['p1'], weights=data['LogLike'] - data['LogLike'].max())

Nope, the likelihood ratio plot is not actually a histogram/distribution -- as you say below it is a function maximized over all the parameters not plotted.

Maybe you can help me to get the connection to the profile likelihood: I though for that one I would, instead of integrating the other dimensions, maximize the likelihood in the other dimensions. Now this is somewhat difficult if I only have samples and not an analytical expression...

As you say, in principle we would need the analytical functions to get the profile likelihood correct. But the approximation to use is to do bin-wise maximization. Say you have samples from a likelihood function L(x,y,z) and want to plot the profile likelihood L_prof(x). Then you do the following:

Then you can interpolate with a smooth line between the bin centers, or simply show the profile likelihood as if it was the outline of a histogram, even though it shouldn't be interpreted as any sort distribution. (That way you are "honest" about the bin widths you used.)

So to get an accurate profile likelihood plot means that you need enough samples to have ~approximately the max-likelihood point in (y,z) space for each individual x bin. This typically requires many more (x,y,z) samples compared to what you need to get the likelihoodprior integral* over (y,z) space approximately correct for each x bin. So that's why profile likelihoods are more computationally demanding than posterior distributions.

For 1D plots for θ I would propose to use the profile likelihood ratio,

L_prof(θ) / max(L(θ))

where the max(L(θ)) is the global maximum-likelihood value. Since this ratio has a maximum value of 1 (for the global best-fit θ), it's easy to plot it in the same plot as the 1D posterior distribution, if you just rescale the posterior histogram p(θ) as

p(θ) / max(p(θ))

Using Wilk's theorem we can connect the likelihood ratio to n-sigma confidence regions for θ. For 1D plots the likelihood ratio values are

(The values correspond to the those you would use to define n-sigma regions in a chi^2 fit, just translated from chi^2 --> log-likelihood difference --> likelihood ratio.)

So it might be nice to have some horizontal lines at those fixed values, at least 1-sigma and 2-sigma. Then by intersection with the profile likelihood graph you can read off the n-sigma confidence regions for θ. See Fig 15 of https://arxiv.org/pdf/1809.02097.pdf for an example. There you also see the profile-likelihood-as-histogram-outline.

fzeiser commented 4 years ago

As you say, in principle we would need the analytical functions to get the profile likelihood correct. But the approximation to use is to do bin-wise maximization.

That makes more sense. I really couldn't grasp how the thing I tried before should have been connected to the profile likelihood. Now it makes sense ;P. And it looks fine, too. I'll stay with the display without smoothing for now. nld_T_posterior_hist_small

So to get an accurate profile likelihood plot means that you need enough samples to have ~approximately the max-likelihood point in (y,z) space for each individual x bin. This typically requires many more (x,y,z) samples compared to what you need to get the likelihood*prior integral over (y,z) space approximately correct for each x bin. So that's why profile likelihoods are more computationally demanding than posterior distributions.

I've just plotted the results I get for now. I guess it might be somewhat more challending to make sure that we meet the requirements. But I'll set this aside for now. The original issue is solved I think, which was how I should/could plot this. The rest can be refined later.

anderkve commented 4 years ago

Awesome, this looks very good! :)

I've just plotted the results I get for now. I guess it might be somewhat more challending to make sure that we meet the requirements.

Yep, it is. One thing to keep in mind with profile likelihoods is that the results depend very directly on our ability to accurately determine the best-fit point, since all the n-sigma confidence regions are determined relative to the likelihood value Lmax of that point. So for profile likelihood results you typically need to run at least some scans with very stringent convergence settings for your scanning algorithm.

The good thing is that you can without problems combine samples from different scans, regardless of sampling priors, since the results do not rely on an interpretation of the distribution of samples. So you can have one scan set up to sample the θ-space quite broadly, to make sure you don't miss any islands of good likelihood, and another scan really focused on maximizing the likelihood in a small best-fit region, and then simply join the datasets afterwards.