How to plot random(group-based scores) effects?

fatihbozdag commented 1 year ago

Greetings all,

I've been looking for a way to plot ppc (probabilities) of categorical response per each variable included in the random effects. Indeed, I asked the question in Arviz github (https://github.com/arviz-devs/arviz/issues/2188) however, I couldn't find a solution yet.

So basically, following the code below;

import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import pymc as pm
import pymc.sampling_jax
import numpyro

df = pd.DataFrame({"Target": target_variable, "Random_Effects_Var": random_effects, "Column1": column1, "Column2": column2})

model = bmb.Model("Target~ Column1 + Column2 + (1|Random_Effects_Var)",
                  data = df, family = "categorical", auto_scale = True, priors = priors)

model.build()

with model.backend.model:
    idata=pm.sampling_jax.sample_numpyro_nuts(draws= 250, tune = 50, target_accept = .99,
                                              postprocessing_backend = 'cpu', idata_kwargs={"log_likelihood":True},chain_method='parallel')
    posterior_predictive = pm.sample_posterior_predictive(trace = idata, extend_inferencedata=True)

idata

I want to plot the probabilities of each item in the categorical response variable (3 level) per random effect(2 level). I did it before with rstanarmand easystatson R for binomial dist, yet got stuck here on Python.

This is from my own study. image_2023-02-04_040254680

fatihbozdag commented 1 year ago

I found a similar plot here image_2023-02-04_041004841

_, ax = plt.subplots(figsize = (8, 8))

# Add vertical line for the global probability of hitting
ax.axvline(x=(df["H"] / df["AB"]).mean(), ls="--", color="black", alpha=0.5)

# Create forestplot with ArviZ, only for the mean.
az.plot_forest(
    [idata_non_hierarchical, idata_hierarchical],
    var_names="p(H, AB)_mean",
    combined=True,
    colors=["#666666", RED],
    linewidth=2.6,
    markersize=8,
    ax=ax
)

# Create custom y axis tick labels
ylabels = [f"H: {round(h)}, AB: {round(ab)}" for h, ab in zip(df["H"].values, df["AB"].values)]
ylabels = list(reversed(ylabels))

# Put the labels for the y axis in the mid of the original location of the tick marks.
ax.set_yticklabels(ylabels, ha="right")

# Create legend
handles = [
    Patch(label="Non-hierarchical", facecolor="#666666"),
    Patch(label="Hierarchical", facecolor=RED),
    Line2D([0], [0], label="Mean probability", ls="--", color="black", alpha=0.5)
]

legend = ax.legend(handles=handles, loc=4, fontsize=14, frameon=True, framealpha=0.8);

(https://bambinos.github.io/bambi/notebooks/hierarchical_binomial_bambi.html), comparing two data on the same variable (this is exactly what I want to do but with categorical response). However, the problem is it is almost impossible to apply the same method on data with more than 14000 obs!

tomicapretto commented 1 year ago

@fatihbozdag thanks for opening the issue!

Could you share a reproducible example (with some data) so I can better help you? The data of course does not need to be real. It can be either simulated or anonymized

fatihbozdag commented 1 year ago

Sure thing, Indeed, I've uploaded pickled inference data https://figshare.com/articles/software/results_pickle/21789347 But here is toy data with sample code.

x =  np.array(["Pred1"] * 40 + ["Pred2"] * 40 + ["Pred3"] * 70)
y =  np.array(["A"] * 50 + ["B"] * 50 + ["C"] * 50)
r = np.array(["R1"] * 90 + ["R2"] * 60)
data = pd.DataFrame({"y":y, "x": x, "r":r})
model = bmb.Model("y~ 0 + x + (1|r)", data, family = "categorical", auto_scale = True)

model.build()

model
Formula: y~ 0 + x + (1|r)
Family name: Categorical
Link: softmax
Observations: 150
Priors:
  Common-level effects
    x ~ Normal(mu: [0. 0. 0.], sigma: [5.6533 5.6533 5.0111])

  Group-level effects
    1|r ~ Normal(mu: 0, sigma: HalfNormal(sigma: 4.0329))

with model.backend.model:
    idata=pm.sampling_jax.sample_numpyro_nuts(draws= 250, tune = 50, target_accept = .99,
                                              postprocessing_backend = 'cpu', idata_kwargs={"log_likelihood":True},chain_method='parallel')
    posterior_predictive = pm.sample_posterior_predictive(trace = idata, extend_inferencedata=True)

idata

Inference data with groups:
    > posterior
    > posterior_predictive
    > log_likelihood
    > sample_stats
    > observed_data

idata.posterior

<xarray.Dataset>
Dimensions:        (chain: 4, draw: 250, x_dim: 3, y_dim: 2, r__factor_dim: 2)
Coordinates:
  * chain          (chain) int64 0 1 2 3
  * draw           (draw) int64 0 1 2 3 4 5 6 7 ... 243 244 245 246 247 248 249
  * x_dim          (x_dim) <U5 'Pred1' 'Pred2' 'Pred3'
  * y_dim          (y_dim) <U1 'B' 'C'
  * r__factor_dim  (r__factor_dim) <U2 'R1' 'R2'
Data variables:
    x              (chain, draw, x_dim, y_dim) float64 -7.286 -2.195 ... 7.844
    1|r_offset     (chain, draw, r__factor_dim, y_dim) float64 -0.1529 ... -0...
    1|r_sigma      (chain, draw) float64 10.41 4.648 5.216 ... 5.3 2.921 8.414
    1|r            (chain, draw, r__factor_dim, y_dim) float64 -1.591 ... -1.161
Attributes:
    created_at:     2023-02-04T12:42:15.801626
    arviz_version:  0.14.0

model.predict(idata) ###apply bambi predict to obtain mean###
<xarray.Dataset>
Dimensions:        (chain: 4, draw: 250, x_dim: 3, y_dim: 2, r__factor_dim: 2,
                    y_mean_dim: 3, y_obs: 150)
Coordinates:
  * chain          (chain) int64 0 1 2 3
  * draw           (draw) int64 0 1 2 3 4 5 6 7 ... 243 244 245 246 247 248 249
  * x_dim          (x_dim) <U5 'Pred1' 'Pred2' 'Pred3'
  * y_dim          (y_dim) <U1 'B' 'C'
  * r__factor_dim  (r__factor_dim) <U2 'R1' 'R2'
  * y_mean_dim     (y_mean_dim) <U1 'A' 'B' 'C'
  * y_obs          (y_obs) int64 0 1 2 3 4 5 6 7 ... 143 144 145 146 147 148 149
Data variables:
    x              (chain, draw, x_dim, y_dim) float64 -7.286 -2.195 ... 7.844
    1|r_offset     (chain, draw, r__factor_dim, y_dim) float64 -0.1529 ... -0...
    1|r_sigma      (chain, draw) float64 10.41 4.648 5.216 ... 5.3 2.921 8.414
    1|r            (chain, draw, r__factor_dim, y_dim) float64 -1.591 ... -1.161
    y_mean         (chain, draw, y_obs, y_mean_dim) float64 0.9968 ... 0.7472
Attributes:
    created_at:     2023-02-04T12:42:15.801626
    arviz_version:  0.14.0

tomicapretto commented 1 year ago

I'm not sure I understand exactly what you want to do

Are you trying to reproduce this figure? https://user-images.githubusercontent.com/59308858/216736802-9b1f1f97-80f9-4696-976d-427f8bad4c54.png Although that one has two categories and your response contains three.

"plot ppc (probabilities) of categorical response per each variable included in the random effects"

You could create a plot where on the y-axis there's the probability of the outcome, on the horizontal axis the levels of a grouping variable used in the "random effects" part, and shows multiple estimates with credible intervals (one per response level). Is that what you want? Just take into account that you'll need to pick a value for any other predictor your model may have.

fatihbozdag commented 1 year ago

yeap, that's exactly what I want to do, plotting probabilities of A, B, and C given R1 and R2.

P.S. sorry, I cannot focus on the topic and write more details as a massive earthquake struck my hometown, Adana/Turkey. Can you please gimme a few days? Currently, I work as a freelance translator to guide Rescue Teams.

fatihbozdag commented 1 year ago

I'm not sure I understand exactly what you want to do

Are you trying to reproduce this figure? https://user-images.githubusercontent.com/59308858/216736802-9b1f1f97-80f9-4696-976d-427f8bad4c54.png Although that one has two categories and your response contains three.

"plot ppc (probabilities) of categorical response per each variable included in the random effects"

You could create a plot where on the y-axis there's the probability of the outcome, on the horizontal axis the levels of a grouping variable used in the "random effects" part, and shows multiple estimates with credible intervals (one per response level). Is that what you want? Just take into account that you'll need to pick a value for any other predictor your model may have.

Hi again. I am back to ordinary life(sort of). How can I produce such a plot? Yeap, this is categorical regression consisting of response with 3 levels.

fatihbozdag commented 1 year ago

Maybe I should add more info. Indeed the problem is I am unable to include "response reference level" in the plot. Or else, Arviz az.plot_forest works without reference level.

Figure 2023-02-04 012447

fatihbozdag commented 1 year ago

update for another unsuccessful attempt;

following model.predict(idata), values including reference level are added to inference data as y.mean_dim. However, arviz coords does not accept new variable as a valid one.

coords = {"y_dim": ["B","C"]} ## works but not reference level, assuming "A" is the reference.

az.summary(idata, var_names =x_dim, coords = coords)

coords = {"y_mean_dim": ["B","C"]} ## does not work, but idata.posterior.y_mean_dim actually includes the reference level yet

az.summary(idata, var_names = x_dim, coords = coords) says

KeyError: 'Coords should follow mapping format {coord_name:[dim1, dim2]}. Check that coords structure is correct and dimensions are valid. "\'y_mean_dim\' is not a valid dimension or coordinate"'

tomicapretto commented 1 year ago

@fatihbozdag I'm sorry you were affected by the earthquake. I hope you're doing better now!

What I don't understand from this visualization https://user-images.githubusercontent.com/59308858/216736802-9b1f1f97-80f9-4696-976d-427f8bad4c54.png is what the line represents. I guess the line is a probability going from 0 to 1, and then the closer it's to zero, the closer it's to DO, and the closer it's to one, the closer it's to PD. Is that correct?

But in this case, you have three response levels, so you can't get a plot like that one.

Also, you have other predictors on top of the group-level effect you're interested in (r in the toy example). How did you handle these other predictors in the visualization above? Did you set them to an equal value, or did you marginalize over their possible range of values?

fatihbozdag commented 1 year ago

Thank you, @tomicapretto,

well, doing much better but the whole country is struggling and is likely to do so for a while. As of writing the post, we've just had another one (5.1) I guess, we have to get used to it. Anyways.

Your interpretation is correct. Other predictors are kept at the response level since all predictors are categorical variables.

to clarify,

model.predict(idata, kind = "pps", include_group_specific=True)
ax = az.plot_ppc(idata)
ax.set_xticks([0.5, 1.5, 2.5])
ax.set_xticklabels(model.response.term.levels)

provides such a plot. So what if I would like to get such plot per random effects? image_2023-02-16_230718678

bambinos / bambi

How to plot random(group-based scores) effects? #637