arviz-devs / arviz

Exploratory analysis of Bayesian models with Python
https://python.arviz.org
Apache License 2.0
1.58k stars 394 forks source link

Unable to Plot Posterior Predictive Check #1658

Closed ghost closed 1 year ago

ghost commented 3 years ago

I get a weird bug when I try to run the following code, which just fits a model and then (should) plot the posterior predictive check:

#%%
with pm.Model() as maskingModel:

    # Hyperpriors
    # Few people use masks in Cameroon, from correspondents
    mu = pm.Normal("Mean", mu=-4.5, sigma=.5)
    var = pm.InverseGamma("Var", alpha=4, beta=3 * .3**2)
    sigma = pm.Deterministic("sd", var**.5)

    # Informative priors/regularizations on variance --
    # the intervention should be similarly effective in all villages.
    timeMu = pm.Normal("Average Time Effect", mu=.6, sigma=.3)
    timeVar = pm.InverseGamma("Var(Time Effect)", alpha=4, beta=3*.3**2)
    timeSigma = pm.Deterministic("sd(Time Effect)", timeVar**.5)

    # Regularizing prior:
    effectMu = pm.Normal("Average Campaign Effect", mu=0, sigma=3)
    effectVar = pm.InverseGamma("Effect Variance", alpha=4, beta=3*.3**2)
    effectSigma = pm.Deterministic("sd(Campaign Effect)", effectVar**.5)

    # Locals probably overestimate mask use --
    # people tend to overestimate small numbers
    estimateBias = pm.Normal("Bias of Est", mu=.5, sigma=1)
    estimateVar = pm.InverseGamma("Var(Est)", alpha=2, beta=.4**2)
    estimateSigma = pm.Deterministic("Std Err(Est)", estimateVar**.5)
    campaignBiasMu = pm.Normal("Bias from Campaign", mu=.2, sigma=1)
    campaignBiasVar = pm.InverseGamma("Var(Bias from Campaign)",
                                      alpha=2, beta=2*.3**2)
    campaignBiasSigma = pm.Deterministic("Std Err(Bias from Campaign)",
                                         campaignBiasVar**.5)
    campaignBias = pm.Normal("Local Bias Caused by Campaign",
                             mu=campaignBiasMu, sigma=campaignBiasSigma,
                             shape=numVil)

    constant = pm.Normal("Constants", mu, sigma, shape=numVil)
    randomChange = pm.Normal("Time effects", timeMu, timeSigma, shape=numVil)
    campaignEffect = pm.Normal("Campaign effects", effectMu, effectSigma,
                               shape=numVil
    )

    localBias = pm.Normal("Local Bias in Estimates", mu = estimateBias,
                          sd = estimateSigma, shape=numVil)

    maskRate = pm.Deterministic("Masking Rate (Logit)",
                        constant[data["vilIndex"]]
                        + randomChange[data["vilIndex"]] *
                        data["followup"].values
                        + campaignEffect[data["vilIndex"]]
                        * data["followup"].values
                        * data["campaign"].values
                        )

    estimates = pm.LogitNormal("Local Estimates", mu=
        maskRate[data["vilIndex"]] + localBias[data["vilIndex"]] +
        campaignBias[data["vilIndex"]] * data["campaign"] * data["followup"],
        sd=estimateSigma, observed=data.wearsMaskEst)

    counts = pm.Binomial("Counts", 550, pm.invlogit(maskRate), shape=len(data),
                    observed=data.maskCount)

    effect = pm.Deterministic("Increase in Mask-Wearing Proportion",
                              pm.invlogit(maskRate))

#%%

with maskingModel:

    trace = pm.sample(30_000, chains=6, cores=6, tune=20_000,
                      nuts={'target_accept':.96},
                      metropolis={'target_accept':.96, 'samples':30_000,
                                  'chains':6, 'cores':6},
                      init="advi+adapt_diag"
                      )

    params = ["Mean",
              "sd",
              "Average Campaign Effect",
              "sd(Campaign Effect)",
              "Average Time Effect",
              "sd(Time Effect)",
              "Counts_missing"
              ]

    az.concat(trace, az.from_dict(posterior_predictive=
                                  pm.fast_sample_posterior_predictive(trace)
                                  ), inplace=True)

#%%
with maskingModel:

    az.plot_trace(trace, var_names=params, compact=True)
    az.plot_mcse(trace, var_names=params)

    az.plot_posterior(trace, var_names=params)

    az.plot_ppc(trace, var_names=["Counts"])
    az.plot_ppc(trace, var_names=["Local Estimates"])

    az.plot_loo_pit(trace, "Counts")
    az.plot_loo_pit(trace, "Counts", kind="scatter")
    az.plot_loo_pit(trace, "Local Estimates")
    az.plot_loo_pit(trace, "Local Estimates", kind="scatter")

Returned error:

runcell(4, '/home/lime/Documents/healthModel2020/model.py')
/home/lime/.local/lib/python3.9/site-packages/arviz/stats/density_utils.py:962: RuntimeWarning: overflow encountered in long_scalars
  return np.arange(x_min, x_max + width + 1, width)
Traceback (most recent call last):

  File "/home/lime/Documents/healthModel2020/model.py", line 130, in <module>
    az.plot_ppc(trace, var_names=["Counts"])

  File "/home/lime/.local/lib/python3.9/site-packages/arviz/plots/ppcplot.py", line 318, in plot_ppc
    axes = plot(**ppcplot_kwargs)

  File "/home/lime/.local/lib/python3.9/site-packages/arviz/plots/backends/matplotlib/ppcplot.py", line 142, in plot_ppc
    ax_i.plot(

  File "/usr/lib/python3.9/site-packages/matplotlib/axes/_axes.py", line 1605, in plot
    lines = [*self._get_lines(*args, data=data, **kwargs)]

  File "/usr/lib/python3.9/site-packages/matplotlib/axes/_base.py", line 315, in __call__
    yield from self._plot_args(this, kwargs)

  File "/usr/lib/python3.9/site-packages/matplotlib/axes/_base.py", line 501, in _plot_args
    raise ValueError(f"x and y must have same first dimension, but "

ValueError: x and y must have same first dimension, but have shapes (1,) and (0,)
OriolAbril commented 3 years ago

Can you share the output of printing trace.posterior_predictive?

It could be an issue with data conversion, and how ArviZ interprets what dimensions are chains, draws and other variable dimensions. If so, I believe that using keep_size=True in fast_sample_posterior_predictive will solve the problem.

ghost commented 3 years ago

Can you share the output of printing trace.posterior_predictive?

It could be an issue with data conversion, and how ArviZ interprets what dimensions are chains, draws and other variable dimensions. If so, I believe that using keep_size=True in fast_sample_posterior_predictive will solve the problem.

Here's the output:

<xarray.Dataset>
Dimensions:                (Counts_dim_0: 17, Local Estimates_dim_0: 17, chain: 6, draw: 30000)
Coordinates:
  * chain                  (chain) int64 0 1 2 3 4 5
  * draw                   (draw) int64 0 1 2 3 4 ... 29996 29997 29998 29999
  * Local Estimates_dim_0  (Local Estimates_dim_0) int64 0 1 2 3 ... 13 14 15 16
  * Counts_dim_0           (Counts_dim_0) int64 0 1 2 3 4 5 ... 12 13 14 15 16
Data variables:
    Local Estimates        (chain, draw, Local Estimates_dim_0) float64 0.007...
    Counts                 (chain, draw, Counts_dim_0) int64 4 9 12 ... 12 11 74
Attributes:
    created_at:     2021-04-10T17:54:13.515358
    arviz_version:  0.11.2

keep_size = True does not seem to make a difference. I suspect it's related to #982 since I have missing data that was imputed in PyMC3.

OriolAbril commented 3 years ago

Sounds related yes. What is the output of trace.observed_data? Maybe it's possible to subset one to match the other so it can be plotted (I assume that using plot_ppc(..., observed=False) does work)

ghost commented 3 years ago

@OriolAbril Here's the output of trace.observed_data:

<xarray.Dataset>
Dimensions:                (Counts_dim_0: 17, Local Estimates_dim_0: 17)
Coordinates:
  * Local Estimates_dim_0  (Local Estimates_dim_0) int64 0 1 2 3 ... 13 14 15 16
  * Counts_dim_0           (Counts_dim_0) int64 0 1 2 3 4 5 ... 12 13 14 15 16
Data variables:
    Local Estimates        (Local Estimates_dim_0) float64 0.01485 ... 0.2129
    Counts                 (Counts_dim_0) float64 2.0 7.0 15.0 ... nan nan nan
Attributes:
    created_at:                 2021-05-01T02:47:28.197519
    arviz_version:              0.11.2
    inference_library:          pymc3
    inference_library_version:  3.11.2

Prior predictive checks do work (observed=False).

Also notable is that plot_loo_pit doesn't work with missing data:

az.plot_loo_pit(trace, "Counts", ecdf=True)

Output from spyder call 'get_namespace_view':

Output from spyder call 'get_namespace_view':
/home/lime/.local/lib/python3.9/site-packages/arviz/stats/stats.py:876: RuntimeWarning: overflow encountered in exp
  weights = 1 / np.exp(len_scale - len_scale[:, None]).sum(axis=1)
Traceback (most recent call last):

  File "<ipython-input-21-7f5ef319e08b>", line 1, in <module>
    az.plot_loo_pit(trace, "Counts", ecdf=True)

  File "/home/lime/.local/lib/python3.9/site-packages/arviz/plots/loopitplot.py", line 130, in plot_loo_pit
    loo_pit = _loo_pit(idata=idata, y=y, y_hat=y_hat, log_weights=log_weights)

  File "/home/lime/.local/lib/python3.9/site-packages/arviz/stats/stats.py", line 1586, in loo_pit
    y, y_hat = smooth_data(y, y_hat)

  File "/home/lime/.local/lib/python3.9/site-packages/arviz/stats/stats_utils.py", line 563, in smooth_data
    csi = CubicSpline(x, obs_vals)

  File "/home/lime/micromamba/lib/python3.9/site-packages/scipy/interpolate/_cubic.py", line 629, in __init__
    x, dx, y, axis, _ = prepare_input(x, y, axis)

  File "/home/lime/micromamba/lib/python3.9/site-packages/scipy/interpolate/_cubic.py", line 54, in prepare_input
    raise ValueError("`y` must contain only finite values.")

ValueError: `y` must contain only finite values.

The LOO-PIT works when I drop the missing data, and also for variables with no missing data, suggesting this is the problem again. What's the intended behavior for missing data? Plotting posterior means/medians, dropping it, or something else?

OriolAbril commented 3 years ago

So the issue here is that the kde/hist of the posterior predictive will include the missing (and therefore imputed data) as part of the distribution, which won't allow to do posterior predictive checks because the observed kde/hist won't and can therefore be completely different distributions. One example, imagine a linear regression where we have y values for low and high x values, but not for middle values of x. The kde/hist of the observed data will be multimodal whereas if we put nans for pymc3 to impute the missing values in y it will look like a plateau (like for example the ppc in https://oriolabril.github.io/oriol_unraveled/python/arviz/pymc3/2019/07/31/loo-pit-tutorial.html#Linear-regression). With LOO-PIT, things are even worse because the calulation itself makes no sense when nans are present.

Thus, ArviZ should not plot ppc/loo_pit with missing data. What we can do is improve how these cases are handled. The two options that come to mind are failing but adding a specific check for nans in order to give an informative error message or trying to have ArviZ automatically subset the data to exclude nans from the plots.

Personally I am more inclined towards the 1st option, because ArviZ should always work in highly dimensional data, where subsetting to remove nans won't be straightforward, we'll get ragged data that will need to be flattened and things like that. So I think the intended behaviour should be an error but not the error you are seeing right now, something that tells users to subset the data to do meaningful plots.

ghost commented 3 years ago

So the issue here is that the kde/hist of the posterior predictive will include the missing (and therefore imputed data) as part of the distribution, which won't allow to do posterior predictive checks because the observed kde/hist won't and can therefore be completely different distributions. One example, imagine a linear regression where we have y values for low and high x values, but not for middle values of x. The kde/hist of the observed data will be multimodal whereas if we put nans for pymc3 to impute the missing values in y it will look like a plateau (like for example the ppc in https://oriolabril.github.io/oriol_unraveled/python/arviz/pymc3/2019/07/31/loo-pit-tutorial.html#Linear-regression). With LOO-PIT, things are even worse because the calulation itself makes no sense when nans are present.

Thus, ArviZ should not plot ppc/loo_pit with missing data. What we can do is improve how these cases are handled. The two options that come to mind are failing but adding a specific check for nans in order to give an informative error message or trying to have ArviZ automatically subset the data to exclude nans from the plots.

Personally I am more inclined towards the 1st option, because ArviZ should always work in highly dimensional data, where subsetting to remove nans won't be straightforward, we'll get ragged data that will need to be flattened and things like that. So I think the intended behaviour should be an error but not the error you are seeing right now, something that tells users to subset the data to do meaningful plots.

I mean, I think it makes sense to allow plotting these both with and without imputation, so that you can see how imputation affects your data. In cases where it's only necessary to plot a kde and not specific points, it makes sense to plot the weighted mixture of the kde and the posterior distribution imputed for a specific point. It's definitely useful to be able to see where your imputations are adding probability mass -- if they're all stuck in one area, that might suggest censoring, for example.

OriolAbril commented 3 years ago

I am not sure I understand your point, do you have some example to clarify? Are you saying in some cases it could make sense to have a ppc that compares the posterior predictive with imputed data to observations without imputed data? But not for loo_pit/bpv right?

Also @junpenglao may be able to help as he is much more knowledgeable on missing data and imputation and how they work on pymc.

ghost commented 3 years ago

I am not sure I understand your point, do you have some example to clarify? Are you saying in some cases it could make sense to have a ppc that compares the posterior predictive with imputed data to observations without imputed data? But not for loo_pit/bpv right?

Also @junpenglao may be able to help as he is much more knowledgeable on missing data and imputation and how they work on pymc.

For both cases, I'm saying that it should be possible to plot both with and without imputed data using some option like inc_missing = True. This way, it would be possible to see whether your posterior predictive check or LOO-PIT plots look different when missing data is included. If objects are plotted on the kde plot using the point-cloud method (drawing a Gaussian or other pdf for each observation, then averaging these pdfs together), it should be possible to add the PDF of the probability integral transform . e.g., if F is the estimated CDF after leaving out some missing data point x, then you can estimate a distribution for F(x) by taking all the sampled values of x and taking F(x) for each one. This distribution can be plotted instead of the usual probability cloud used with KDE.

OriolAbril commented 3 years ago

I think that there are several different things to take into account in this discussion, so I'll try to separate those in different paragraphs.

The first thing to take into account is the type of imputation. If missing data is imputed before sampling, then everything should work the same as with no missing data at a code level, but the conceptual implications need to be weighed carefully. Everything I'll say here is for dealing with missing data as np.nan, which is I believe the case we have been considering always so far, I just wanted to be explicit.

plot_ppc can work with that, but the question here is not if it can work but if it should work. IIUC, the most common case of missing data in pymc is having missing data in the observed data only, which pymc handles by fitting the model on the non-missing data and generating posterior predictive samples for the missing data. I have never seen anybody doing ppc combining the post pred samples corresponding to the actual observed data with out of sample post pred samples (or predictions generally called). If I see a couple usage example of plot_ppc with missing data/predictions, I would not oppose adding a new argument to plot_ppc as you mentioned, but I can't work on that in the short/mid term future, would you be interested in submitting a PR for this eventually?

plot_loo_pit has a similar goal to plot_ppc but the way in which it tries to achieve this "common" goal is completely different. It combines cross validation with pointwise comparisons of posterior predictive samples. Conceptually, loo pit is fitting the model excluding one observation, generating the posterior predictive samples (predictions!) for this excluded observation and comparing these predictions to the actual observed value (because we have excluded the observation in doing cross validation, but we do know what was observed). In this case, I don't see how this can be compatible with missing data at a conceptual level, not sure what I am missing.