bambinos / bambi

BAyesian Model-Building Interface (Bambi) in Python.
https://bambinos.github.io/bambi/
MIT License
1.06k stars 122 forks source link

[Help Needed} easiest way to extract out the summary data in link and probability scale #446

Closed timpiperseek closed 10 months ago

timpiperseek commented 2 years ago

Using a cutdown version of the example shown here.

I can calculate the summary metrics like the following for each parameter.

import arviz as az
import bambi as bmb
import numpy as np
df = bmb.load_data("batting")

# Then clean some of the data
df["AB"] = df["AB"].replace(0, np.nan)
df = df.iloc[0:15]

priors = {
    "playerID": bmb.Prior("Normal", mu=0, sigma=1),
    "1|playerID": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("HalfNormal", sigma=1))
}

model_hierarchical = bmb.Model("p(H, AB) ~ 0 + playerID + (1|playerID)", df, family="binomial", priors=priors)
idata_hierarchical = model_hierarchical.fit(target_accept=0.96,chains = 2, cores = 2)

az.summary(idata_hierarchical)

I assume that this will give me the summary metrics of each parameter in the link scale but how do I go about getting it on the probability scale?

tomicapretto commented 2 years ago

Hi @timpiperseek

I'm sorry but I'm not sure I understand your question. The summary contains information about the marginal posterior for each coefficient in the model. You can think of this model as a linear model for the logit of the probability of batting.

You can't take a single coefficient and put it into a probability scale.

However, what you can do, is to compute and interpret odds ratios. Are you familiar with those?

EDIT

Or what you want is the posterior for the probability of batting for each player?

timpiperseek commented 2 years ago

Thanks @tomicapretto,

I guess I am after the fixed effect coefficient with a marginal interpretation. This stack exchange question should help to understand where my head is at and also points to a couple of frequentist package which has the ability to extract this information.

I can work with log odds or odds ratio scale but I would prefer (if possible) to report my finding in the probability scale as I think it would be easy for my stakeholders to understand.

tomicapretto commented 2 years ago

"To report my finding in the probability scale" is not very specific to me. Do you mean your findings in terms of each coefficient in the model or in terms of each individual?

What you can interpret in terms of probabilities is the response for each individual. The .fit() method has the include_mean argument that when is set to True computes the posterior of the mean of the response. In this case, it is on a probability scale because it is logistic regression.

Have a look at this:

df = bmb.load_data("batting")
df["AB"] = df["AB"].replace(0, np.nan)
df = df.iloc[0:15]

priors = {
    "playerID": bmb.Prior("Normal", mu=0, sigma=1),
    "1|playerID": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("HalfNormal", sigma=1))
}

model_hierarchical = bmb.Model("p(H, AB) ~ 1 + (1|playerID)", df, family="binomial", priors=priors)
idata_hierarchical = model_hierarchical.fit(target_accept=0.9, include_mean=True)
az.plot_trace(idata_hierarchical, var_names=["p(H, AB)_mean"], compact=False);

Then you have the posterior of the probability of batting for each individual in the dataset.

plot

@timpiperseek please note the following:

The code I attached here uses "p(H, AB) ~ 1 + (1|playerID)" instead of the original "p(H, AB) ~ 0 + playerID + (1|playerID)". I recommend you to use the new version.

The original is a mistake I made that I didn't fix yet. It does not make sense to have the same effect playerID as a common effect and as the group in a group-specific effect. They end up modeling the same quantity and that's why the group-specific effects end up centered around 0.

timpiperseek commented 2 years ago

Thanks, @tomicapretto I feel like my example was a poor choice and is confusing the question. So I will choose another example that hopefully will be clearer as to what I am trying to do.

Suppose we have 20 classrooms and each classroom has on average 20 students. I would like to test some new intervention on these students to find out what impact this intervention has on whether the student will pass a test. But I can only perform the intervention at the class level.

So I will generate my data, like the following.

#%% generate simulated dataframe
import uuid 
import numpy as np
import pandas as pd

n_classrooms = 20 # number of simulated classrooms 
mean_students_per_classroom = 20 
prob_treat = 0.5 # probability that a classroom is assigned to the treatment 
treatment_effect_magnitude = 0.25 # the impact of the treatment, but this is currently unknown to me

#create a list of ids for each classrooms
classroom_ids = [str(uuid.uuid4()) for _ in range(n_classrooms)]

# number of students assigned to each classrooms
student_in_each_class = np.random.poisson(lam = mean_students_per_classroom, size=n_classrooms)

# total number of students
n_students = np.sum(student_in_each_class)

# create a list of student ids
student_ids = [str(uuid.uuid4()) for _ in range(n_students)]

# mu_i is the affinity of student i to pass the test 
mu_i = np.random.uniform(-0.4, 0.4, n_students) 

# zeta_j is the impact of the classroom on each individual student
zeta_j = np.random.uniform(-0.4, 0.4, n_classrooms) 

# tau_j is the treatment effect on the classroom
tau_j = np.random.uniform(0,treatment_effect_magnitude, n_classrooms)

# assign classroom to a treatment or control
randomisation_unit = np.random.binomial(1, prob_treat, n_classrooms)

y = []
classroom_treatment = []
student_id_list = []
classroom_id_list = []

for classroom_j, classroom_id in enumerate(classroom_ids):
    treatment = randomisation_unit[classroom_j]
    for student_i, student_id in enumerate(student_ids):

        # generate probability that the student will pass the test
        prob_of_apply = mu_i[student_i] + zeta_j[classroom_j] + treatment * tau_j[classroom_j]

        outcome = 1 if prob_of_apply > 0.5 else 0

        # save result
        y.append(outcome)
        classroom_treatment.append('control' if treatment == 0 else 'treatment')
        classroom_id_list.append(classroom_id)
        student_id_list.append(student_id)

# generate dataframe
df = pd.DataFrame({
    'classroom': classroom_id_list,
    'treatment': classroom_treatment,
    'student_id': student_id_list,
    'y':y
})

# calculate the mean for each treament group
df.groupby('treatment').y.mean()

# results for me are the following
# control        0.030067
# treatment    0.181292

So because I can only perform the intervention at the classroom level, I would like to fit a hierarchical model. Which I do like so.

import bambi
classroom_prior = bambi.Prior("Beta", alpha=3, beta = 3)
priors = {
"1|classroom": classroom_prior,
}

formula = 'y ~ treatment + (1|classroom)'
model = bambi.Model(formula, df, family='bernoulli', priors = priors)
fitted_model= model.fit( 
    draws = 1000,
    tune  = 1000,
    cores = 4,
    chains = 4
    )

So I can extract the coefficient for the treatment effect by doing the following.

import arviz as az
az.summary(fitted_model,var_names='treatment')

I get the following output.

mean | sd | hdi_2.5% | hdi_97.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat -- | -- | -- | -- | -- | -- | -- | -- | -- 1.882 | 0.103 | 1.679 | 2.079 | 0.001 | 0.001 | 5424.0 | 3188.0 | 1.0

But what I really want to know is the average impact of the intervention for each student (preferably in the probability scale). I think I can do it like this but I also suspect this might not be correct and probability isn't the most efficient way to do it as I am still trying to get used to idata syntax, so I would love your input on this.

My approach is to predict across every observation but assign each observation as if it was the control. the repeat for each observation if it was the treatment. Then take the difference. which looks like the below.


# predict as if all classrooms where assigned the control
df['treatment'] = 'control'
control_pred = model.predict(fitted_model, data = df,inplace = False)
control_pred_y_mean = control_pred.posterior.y_mean.to_dataframe()

# predict as if all classrooms where assigned the treatment
df['treatment'] = 'treatment'
treatment_pred = model.predict(fitted_model, data = df,inplace = False)
treament_pred_y_mean = treatment_pred.posterior.y_mean.to_dataframe()

# calculate abs difference on probability scale
draws_prob_abs = treament_pred_y_mean - control_pred_y_mean 
prob_abs_diff_post_mean = draws_prob_abs.mean()[0]
prob_abs_diff_ci = draws_prob_abs.y_mean.quantile([.025, .975]).to_dict()

print(f'the intervention had a mean impact of {prob_abs_diff_post_mean:.2%} with a credible region of between {prob_abs_diff_ci[0.025]:.2%} to {prob_abs_diff_ci[0.975]:.2%}')

which provides the following output. the intervention had a mean impact of 14.21% with a credible region of between 9.75% to 22.22%

tomicapretto commented 2 years ago

Hi @timpiperseek I've been having a look at this and I have a couple of comments.

  1. From the data generating process it looks like students are not nested within classrooms. If you have ~20 students per classroom and 20 classrooms you expect ~ 400 observations. But we have ~ 8000 observations. This is because you iterate over students within the iteration over classrooms. Is that on purpose?
  2. prob_of_apply = mu_i[student_i] + zeta_j[classroom_j] + treatment * tau_j[classroom_j] is not a probability actually because it's not bounded by 1. You could pass this value through a logistic sigmoid function (see https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.expit.html)

I have other comments about the model and the priors but first I would like to check the data generating process.

timpiperseek commented 2 years ago

Thanks @tomicapretto

  1. You are correct that there was a mistake, I have now amended it.
  2. I am happy to take your suggestion here and pass it through the expit function.

I have updated the data generating function as such to be.


#%% generate simulated dataframe

import uuid 
import numpy as np
import pandas as pd
from scipy.special import expit

n_classrooms = 20 # number of simulated classrooms 
mean_students_per_classroom = 20 
prob_treat = 0.5 # probability that a classroom is assigned to the treatment 
treatment_effect_magnitude = 0.25 # the impact of the treatment, but this is currently unknown to me

#create a list of ids for each classrooms
classroom_ids = [str(uuid.uuid4()) for _ in range(n_classrooms)]

# number of students assigned to each classrooms
student_in_each_class = np.random.poisson(lam = mean_students_per_classroom, size=n_classrooms)

# total number of students
n_students = np.sum(student_in_each_class)

# create a list of student ids
student_ids = [str(uuid.uuid4()) for _ in range(n_students)]

# mu_i is the affinity of student i to pass the test 
mu_i = np.random.uniform(-0.4, 0.4, n_students) 

# zeta_j is the impact of the classroom on each individual student
zeta_j = np.random.uniform(-0.4, 0.4, n_classrooms) 

# assign classroom to a treatment or control
randomisation_unit = np.random.binomial(1, prob_treat, n_classrooms)

y = []
classroom_treatment = []
student_id_list = []
classroom_id_list = []
student_i = -1

for classroom_j, classroom_id in enumerate(classroom_ids):
    treatment = randomisation_unit[classroom_j]
    for student_ in range(student_in_each_class[classroom_j]):
        student_i += 1
        student_id = student_ids[student_i]

        # generate probability that the student will pass the test
        prob_of_apply = mu_i[student_i] + zeta_j[classroom_j] + treatment * treatment_effect_magnitude
        prob_of_apply = expit(prob_of_apply)

        outcome = 1 if prob_of_apply > 0.5 else 0

        # save result
        y.append(outcome)
        classroom_treatment.append('control' if treatment == 0 else 'treatment')
        classroom_id_list.append(classroom_id)
        student_id_list.append(student_id)

# generate dataframe
df = pd.DataFrame({
    'classroom': classroom_id_list,
    'treatment': classroom_treatment,
    'student_id': student_id_list,
    'y':y
})
tomicapretto commented 2 years ago

Have a look at this gist https://gist.github.com/tomicapretto/25f94bdf070497e0fbf129edd23434a5 and let me know if that helps.

I fixed the data generating process and the way priors are specified. Have a look at the priors for the varying intercept, note it has a hyperprior that you were missing in the original implementation. Also, note that the generation of 'y' is random now.