arviz-devs / arviz

Exploratory analysis of Bayesian models with Python
https://python.arviz.org
Apache License 2.0
1.62k stars 407 forks source link

Posterior mean outside posterior traces when zooming into az.plot_ppc with plt.xlim #1756

Open isosphere opened 3 years ago

isosphere commented 3 years ago

Short Description

I have a distribution that plot_ppc likes to zoom out of, possibly because of the tails. It zooms out so much by default that I can't really check my model posterior with it. So I use plt.xlim to zoom in, but the posterior mean is clearly (IMO) wrong as it lies outside of all of the posterior traces.

This is exclusively a problem in instances when I might want to use plt.xlim, though that's just in my limited experience.

Code Example or link

with model:
    ppc = pm.sample_posterior_predictive(trace)

converted = az.from_pymc3(posterior_predictive=ppc, model=model)
fig, ax = plt.subplots()
az.plot_ppc(converted, alpha=0.025, kind='kde', observed=True, ax=ax)

image

with model:
    ppc = pm.sample_posterior_predictive(trace)

converted = az.from_pymc3(posterior_predictive=ppc, model=model)
fig, ax = plt.subplots()
plt.xlim(-edge_values.max()*1.25, edge_values.max()*1.25)
az.plot_ppc(converted, alpha=0.025, kind='kde', observed=True, ax=ax)

image

ArviZ 0.11.2 pymc3 3.11.2 backend = matplotlib, 3.3.4

ahartikainen commented 3 years ago

Maybe we have off by one error in the mean calculation (or the x is on the edge of the bin, not center)

OriolAbril commented 3 years ago

IIRC, the mean is calculated using all posterior predictive samples whereas plot_ppc by default should only a subset as solid lines. What are the min and max values for edge_returns?

isosphere commented 3 years ago
min  = -1.97
max  =  1.62
size = 36,336,000
len  = 24,000
isosphere commented 3 years ago

IIRC, the mean is calculated using all posterior predictive samples whereas plot_ppc by default should only a subset as solid lines. What are the min and max values for edge_returns?

You might be right about the subset; I'm skimming through the code and found this:

https://github.com/arviz-devs/arviz/blob/928f8f4c34ae20f64bc787486f34dc0602bfae6f/arviz/plots/ppcplot.py#L267

But num_pp_samples should = total_pp_samples, which is the chain size * draw size.

I have 8 chains, and 3000 draws per chain. So num_pp_samples should be 24,000.

The length of my ppc edge_return array is 24,000, size is 36,336,000.

I think it's a binning issue. The bins are set based on the x_min and x_max, and given the min and max I reported above versus the distribution this makes a pretty low resolution binning strategy in my particular case. Unfortunately plot_ppc does not allow us to override this, currently. I could get around this by filtering data fed to plot_ppc, but I don't love manipulating my data used for diagnostic tests.

https://github.com/arviz-devs/arviz/blob/928f8f4c34ae20f64bc787486f34dc0602bfae6f/arviz/stats/density_utils.py#L856

OriolAbril commented 3 years ago

min = -1.97 max = 1.6

Then the dashed line labeled "mean" is doing what it is supposed to do, which may not be ideal though, not sure :thinking:. Moreover, you should also get the same axis with mean=False provided that all posterior predictive samples are plotted, with at least one solid line going up to these -1.97 and 1.6 limits.

The question now could be if it makes sense to do something like plot_density for example and truncate the kde of the mean and/or of the pp kdes so that they don't get to these very low probability extremes

isosphere commented 3 years ago

mean=False does indeed return the same axis.

image

I think I understand your comment re: the mean doing what it's supposed to do w.r.t. the average being pulled by outliers, but I don't think this explains the behaviour.

If you examine the mean line when zoomed in, you'll see that it is comprised of approximately only 8 line segments around the high density interval. I.E.: it is using 8 bins in this area. I believe this is because the bin size is set based on the entire span of the data series (from x min to x max), so we get low resolution binning where it counts. In other words, the binning has a uniform prior for where we need high resolution 😁

It might make more sense to base our binning to be some fraction of the HDI width? There's an analog here with finite element analysis meshing, though I think our case is easier. We could dynamically grid based on probability density.

Reviewing get_bins, it's based on Sturges and Freedman-Diaconis estimators that assume normality of the data. As you can see, my data are far from normal.

OriolAbril commented 3 years ago

I think I understand your comment re: the mean doing what it's supposed to do w.r.t. the average being pulled by outliers, but I don't think this explains the behaviour.

It's not that the mean is pulled by the outliers, the kde that is labeled mean is a kde generated from all samples instead of using one sample only. Therefore it's limits are exactly the min and the max of the whole posterior predictive array.

The solid kdes are generated from a single sample, so it's limits will be the min and max values within the sample. If all samples are plotted, there will be at least one that will have this -1.9 or 1.6, so matplotlib will set the axis there.

Reviewing get_bins, it's based on Sturges and Freedman-Diaconis estimators that assume normality of the data. As you can see, my data are far from normal.

We should expose the multiple kde algorithms available in ArviZ as well as their parameters which I think could fix this resolution problem. Maybe we could do the same with get_bins but I think it's not as critical. Note that get_bins is for histograms only, so it's not being used here with the kdes. Here the issue I think comes from the fact that the kdes are generated with a large yet finite number of points hoping that when plotting it will look smooth enough: https://github.com/arviz-devs/arviz/blob/main/arviz/stats/density_utils.py#L506

isosphere commented 3 years ago

Ah, gotcha. Yeah, that explains why it wants such a wide axis. I'm fine with that so long as I can zoom and get enough discretization around my HDI.

I have a dirty trick working to band-aid patch this, but I don't love it:

mad_thinning = 20
if mad_thinning is not None:
    for var in thinned_ppc:
        mad = st.median_absolute_deviation(thinned_ppc[var].flatten())
        for trace in range(len(thinned_ppc[var])):
            thinned_ppc[var][trace] = np.where(np.abs(thinned_ppc[var][trace]) >  mad*mad_thinning, np.NaN, thinned_ppc[var][trace])       

This replaces any value outside of 20 x the mean absolute deviation for a given trace with NaN, so it won't be included in the kdes.

However, simply pretending that my tails don't exist is probably dangerous. The probability density may be low out there, but the probability mass may not be.

I'd much rather throw too much computational power at using more points than are technically required to render the kdes.

Result:

image

isosphere commented 3 years ago

I edited my copy of density_utils.py to use a default grid_len of 51210 rather than 512. Here's the result with the same xlim zooming. Naturally, it took so much* longer to produce:

image

Using 512*2 is still not granular enough, and still takes much longer.

I think I'd be pretty happy with being able to specify this parameter when running plot_ppc. Even if it was done similar to how we set rcParams for matplotlib. Or maybe be able to use a different approach - this one does not scale nicely computationally.

shoepfl commented 2 years ago

I get the same bad behaviour by zooming in with xlim. When I select only 5 samples for visualization it works fine and the mean is inside the posterior predictive kdes but when i use 100 samples the mean is compleately off:

Code: import pymc import arviz as az `thinned_trace = trace.sel(draw=slice(None, None, 10))

pymc.sample_posterior_predictive(thinned_trace, model = model, extend_inferencedata=True)

ax1 = az.plot_ppc(thinned_trace, num_pp_samples=100, kind = 'kde', mean = True)

for i, ax in enumerate(ax1.flat): ax.set_xlim([-30,100])

grafik

I think the problem arises when the x axes limits are very large but I do not have a workaround yet.