BayesianModelingandComputationInPython / BookCode_Edition1

https://www.bayesiancomputationbook.com
GNU General Public License v2.0
500 stars 129 forks source link

Problem to reproduce Figure 1.5 #211

Open jc-barreto opened 1 year ago

jc-barreto commented 1 year ago

Hello, I'm trying to reproduce the figure 1.5 from the book but since Pymc current version is different from the book, I'm struggling. My problem is with the Prior and Posterior predictive distribution (2nd and 4th graphs).

I have defined the model as in the book:

import numpy as np
import pymc as pm
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import scipy.stats as stats
import arviz as az

az.style.use("arviz-grayscale")
plt.rcParams['figure.dpi'] = 300
np.random.seed(521)
viridish = [(0.2823529411764706, 0.11372549019607843, 0.43529411764705883, 1.0),
            (0.1450980392156863, 0.6705882352941176, 0.5098039215686274, 1.0),
            (0.6901960784313725, 0.8666666666666667, 0.1843137254901961, 1.0)]

Y = stats.bernoulli(0.7).rvs(20)
# Declare a model in PyMC3
with pm.Model() as model:
    # Specify the prior distribution of unknown parameter
    θ = pm.Beta("θ", alpha=1, beta=1)

    # Specify the likelihood distribution and condition on the observed data
    y_obs = pm.Binomial("y_obs", n=1, p=θ, observed=Y)

    # Sample from the posterior distribution
    idata = pm.sample(1000, return_inferencedata=True)

pred_dists = (pm.sample_prior_predictive(1000, model=model),
              pm.sample_posterior_predictive(idata,model=model))

And next I separate the variables I want to plot:


# Prior
prior_samples = pred_dists[0]['prior']['θ'].values # Prior observed
prior_pred_samples = pred_dists[0]['prior_predictive'] # Prior predictions

# Posterior
post_distribution = idata.posterior["θ"] # Posterior observed
posterior_pred_samples = pred_dists[1]['posterior_predictive'] # Posterior Predictions

posterior_pred_graph = posterior_pred_samples['y_obs'].sum(dim=['draw','chain'])
prior_pred_samples_graph = prior_pred_samples['y_obs'].sum(dim=['draw','chain'])

fig,axes =plt.subplots(4,1,gridspec_kw={'hspace': 0.1})

az.plot_dist(prior_samples, plot_kwargs={"color":"0.5"},
             fill_kwargs={'alpha':1}, ax=axes[0])
axes[0].set_title("Prior distribution", fontweight='bold',fontsize=10)
axes[0].set_xlim(0, 1)
axes[0].set_ylim(0, 4)
axes[0].tick_params(axis='both', pad=7)
axes[0].set_xlabel("θ")

az.plot_dist(prior_pred_samples_graph, plot_kwargs={"color":"0.5"},
             fill_kwargs={'alpha':1}, ax=axes[1])
axes[1].set_title("Prior predictive distribution", fontweight='bold',fontsize=10)
# axes[1].set_xlim(-1, 21)
# axes[1].set_ylim(0, 0.15)
axes[1].tick_params(axis='both', pad=7)
axes[1].set_xlabel("number of success")

az.plot_dist(post_distribution, plot_kwargs={"color":"0.5"},
             fill_kwargs={'alpha':1},ax=axes[2])
axes[2].set_title("Posterior distribution", fontweight='bold',fontsize=10)
axes[2].set_xlim(0, 1)
axes[2].set_ylim(0, 5)
axes[2].tick_params(axis='both', pad=7)
axes[2].set_xlabel("θ")

az.plot_dist(posterior_pred_graph, plot_kwargs={"color":"0.5"},
             fill_kwargs={'alpha':1}, ax=axes[3])
axes[3].set_title("Posterior predictive distribution", fontweight='bold',fontsize=10)
# axes[3].set_xlim(-1, 21)
# axes[3].set_ylim(0, 0.15)
axes[3].tick_params(axis='both', pad=7)
axes[3].set_xlabel("number of success") ```

![img](https://github.com/BayesianModelingandComputationInPython/BookCode_Edition1/assets/42352209/9d24af6a-2fd3-4a48-9b96-49822ae005d7)

And I guess I'm not evaluating the posterior_pred_graph and prior_pred_samples_graph correctly? shouldn't be this sum? Could someone help me on this?
brendan-m-murphy commented 1 year ago

If your postererior_predictive is InferenceData type, then try something like: az.plot_dist(posterior_pred_values.sum('y_obs_dim_0').isel(chain=0).y_obs). Each observation is a Bernoulli, but there is a coordinate y_obs_dim_0 (or something similar) whose length is the number of trials. If you sum over this axes, then you get Binomial samples with n=20 (or however many samples were in your observed data used for the likelihood function).

aaelony commented 12 months ago

Is there erratum for the book using the pymc library?

The book code on page 12 (Code 1.7) below gives an error of: KeyError: 'y_obs'

pred_dists = (                                                                                                                                       
    pm.sample_prior_predictive( 1000, model)["y_obs"],                                                                                                    
    pm.sample_posterior_predictive(idata, 1000, model)["y_obs"]                                                                                           
)                                                                                                                                                        
brendan-m-murphy commented 12 months ago

@aaelony The "sample_ ... _predictive" functions return arviz inference data, so you need to use e.g. pm.sample_prior_predictive(1000, model).prior_predictive["y_obs"] (I think.)

Also, for PyMC 5, it seems like you need to use pm.sample_posterior_predictive(idata, model). There is no argument to specify the number of samples (I think it just gives you the same number of posterior predictive samples as samples in the idata, so 1000 samples per chain by default).

aaelony commented 12 months ago

Thank-you, @brendan-m-murphy that was very helpful.

The following with var_names appears to work in PyMC 5:

pred_dists = (
    pm.sample_prior_predictive(1000, model).prior_predictive["y_obs"],
    pm.sample_posterior_predictive(idata, model, var_names = ["y_obs"])
)

Likewise,

pm.sample_prior_predictive(1000, model, var_names=["y_obs"]).prior_predictive

with the var_names arg seems to work as well.

Cheers, Avram