arviz-devs / arviz

Exploratory analysis of Bayesian models with Python
https://python.arviz.org
Apache License 2.0
1.56k stars 388 forks source link

Missing group `prior_predictive` in `idata` object for `ppcplot` #2281

Closed williambeordo closed 9 months ago

williambeordo commented 9 months ago

The following commands:

idata = pm.sample_prior_predictive(...)
az.plot_ppc(idata, group="prior")

produce the error: TypeError: "data" argument must have the group "prior_predictive" for ppcplot

PyMC version: 5.8.2 ArViz version: 0.16.1

OriolAbril commented 9 months ago

If you do inspect the inferencedata object manually, is the prior predictive group there?

If you don't have any observed variables in the pymc model, you can still use sample_prior_predictive because it samples both prior and prior predictive, and the result will be valid and contain prior samples. It won't be usable for plot_ppc though

williambeordo commented 9 months ago

No, the prior predictive group is missing in the inferencedata object, only the prior is present.

Yes, I wanted to do some checks of the model before putting data, because I was trying to follow the notebook "Prior and Posterior Predictive Checks" in the PyMC website. So, are you saying that using the prior group is the same as using the prior predictive?

Thank you for the answer.

OriolAbril commented 9 months ago

There are a lot of moving pieces and terminology which I think is a source of confusion. I'll try to clarify some of it.

We distinguish between the parameter and the observed space. The quantities stored in the prior (and then posterior) groups correspond to the parameter space, whereas the prior predictive and observed data (and posterior predictive or even predictions later) correspond to the observed space. You will see in the notebook you mentioned that even the models used for prior predictive checks have variables that use the observed argument. But that doesn't mean the values provided are used, sample_prior_predictive does not use those values at all; it only registers if the argument is provided to be able to assign each variable to one of these two spaces/groups.

The two first figures in the notebook are doing checks on the observed space using regression lines, but without comparing with the observations, so they are done manually both transformation wise (e.g. y = prior["a"] + prior["b"] * x) and plotting. That is because plot_ppc only takes care of a specific type of prior/posterior predictive checks: comparing the predictive distributions to the distribution of the observed data. So while you can do prior checks (i.e. imagine you knew the slope must be positive for that model, you might want to check the prior of b to ensure this happens) or prior predictive checks (like the ones done in the notebook) without technically using the observed data nor the quantities stored in the prior predictive group, you can do them with plot_ppc

williambeordo commented 9 months ago

Thank you for the explanation, but I’m not sure where the real problem lies. My model has observed variables, so I would like to see the prior prediction along with the observations. Some examples like this one use explicitly plot_ppc. The question is: Why do I get error when I try to produce a similar plot? i.e. why pymc.sample_prior_predictive doesn't store the prior_predictive group?

Sorry for my limited knowledge on this topic. Thank you again.

OriolAbril commented 9 months ago

Can you share your model and the output of printing the inferencedata result? If you have observed variables they should be stored in the prior_predictive group

williambeordo commented 9 months ago

Sure, here is my model:

with pm.Model() as self.model:

    # DEFINING PRIORS FOR THE MODEL PARAMETERS
    def set_prior(name, prior):
        if prior[0] == 'Normal':
            return pm.Normal(name=name, mu=prior[1], sigma=prior[2])
        elif prior[0] == 'Uniform':
            return pm.Uniform(name=name, lower=prior[1], upper=prior[2])
        elif prior[0] == 'LeftTruncatedNormal':
            return pm.TruncatedNormal(name=name, mu=prior[1], sigma=prior[2], lower=prior[3])

    mb = set_prior(self.param_names[0], priors[0])
    bb = set_prior(self.param_names[1], priors[1])
    mdt = set_prior(self.param_names[2], priors[2])
    adt = set_prior(self.param_names[3], priors[3])
    bdt = set_prior(self.param_names[4], priors[4])
    mdT = set_prior(self.param_names[5], priors[5])
    adT = set_prior(self.param_names[6], priors[6])
    bdT = set_prior(self.param_names[7], priors[7])
    g0 = set_prior(self.param_names[8], priors[8])

    obs_r = velocity_obs[0]
    obs_r_err = velocity_obs[1]
    obs_v = velocity_obs[2]
    obs_v_err = velocity_obs[3]

    # EXPECTED VALUES FOR VELOCITY AND DENSITY
    model = MWmond(bulge_par=[mb, bb], thin_disk_par=[mdt, adt, bdt], thick_disk_par=[mdT, adT, bdT], g0=g0)
    mu_vrot = model.vrot(r=obs_r)
    if density_obs is not None:
        mu_rho = model.rho_bar(r=density_obs[0], z=0.)

    # LIKELIHOOD
    pm.Normal(name='Velocity', mu=mu_vrot, sigma=obs_v_err, observed=obs_v)
    if density_obs is not None:
        pm.Normal(name='Density', mu=mu_rho, sigma=density_obs[2], observed=density_obs[1])

    # STARTING THE CALCULATIONS
    idata = pm.sample_prior_predictive(samples=n_samp, model=self.model, var_names=self.param_names)
    import jax
    import tensorflow_probability.substrates.jax as tfp
    jax.scipy.special.erfcx = tfp.math.erfcx
    idata.extend(pm.sampling_jax.sample_numpyro_nuts(draws=n_samp, **kwargs))
    idata.extend(pm.sample_posterior_predictive(idata, model=self.model))
    print(idata)
    az.plot_ppc(idata, group="prior", num_pp_samples=100)
    az.plot_ppc(idata, group="posterior", num_pp_samples=100)

While this is the output of print(idata):

Inference data with groups:
    > posterior
    > posterior_predictive
    > sample_stats
    > prior
    > observed_data
/opt/anaconda3/envs/MilkyWay-Gaia/lib/python3.10/site-packages/arviz/stats/density_utils.py:487: UserWarning: Your data appears to have a single value or no finite values
  warnings.warn("Your data appears to have a single value or no finite values")

(density_obs has only one datum) Thank you very much for the help.

OriolAbril commented 9 months ago
idata = pm.sample_prior_predictive(samples=n_samp, model=self.model, var_names=self.param_names)

You are manually setting var_names and it looks like "Velocity" and "Density" are not in self.param_names, so they are not being sampled. If you use var_names only the variables you provide + their dependencies will have their prior sampled. You should remove that argument if you want everything sampled. You'll see that before sample_prior_predictive, sample or sample_posterior_predictive start, the variables being sampled will be printed.

Side note, if you are inside the model context you can skip the model argument too.

williambeordo commented 9 months ago

Oh I see, thank you very much for your time!