arviz-devs / arviz

Exploratory analysis of Bayesian models with Python
https://python.arviz.org
Apache License 2.0
1.6k stars 401 forks source link

Example using multidimensional coordinates to label a dimension for a parameter? #1628

Closed jburos closed 3 years ago

jburos commented 3 years ago

Short Description

I'm trying to fit a time-series model which my observed data are in a ragged-array or denormalized format (one row per subject-time), and I would like to label certain dimensions with additional information: not only an observation_id, but also subject_id and time. This would apply to predicted quantities, some parameters (depending on use case), and the observed data.

I am considering using xarray's multi-dimensional coordinates to add non-dimension coordinates to the posterior draws, but I haven't been able to put the pieces together just yet. Curious if there is an example of how to use these for inference data?

Aside: this is an example of this use case, one I expect is most common, but I am also thinking of other use cases for multidimensional coordinates. As an example, when fitting hierarchical models, it would be nice to label lower-level parameters with the higher-level group of which they are members.

Code Example or link

This turned out to be a LOT easier than I thought it would be. Including here in case it helps others.

Building on the from CmdStanPy example in the docs:

from cmdstanpy import CmdStanModel
import numpy as np
import arviz as az

schools_code = """
data {
    int<lower=0> J;
    real y[J];
    real<lower=0> sigma[J];
}

parameters {
    real mu;
    real<lower=0> tau;
    real theta_tilde[J];
}

transformed parameters {
    real theta[J];
    for (j in 1:J)
        theta[j] = mu + tau * theta_tilde[j];
}

model {
    mu ~ normal(0, 5);
    tau ~ cauchy(0, 5);
    theta_tilde ~ normal(0, 1);
    y ~ normal(theta, sigma);
}

generated quantities {
    vector[J] log_lik;
    vector[J] y_hat;
    for (j in 1:J) {
        log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
        y_hat[j] = normal_rng(theta[j], sigma[j]);
    }
}
"""

with open("./eight_school.stan", "w") as f:
    print(schools_code, file=f)

stan_file = "./eight_school.stan"
stan_model = CmdStanModel(stan_file=stan_file)
stan_model.compile()

eight_school_data = {
    "J": 8,
    "y": np.array([28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0]),
    "sigma": np.array([15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0]),
}

stan_fit = stan_model.sample(data=eight_school_data)

cmdstanpy_data = az.from_cmdstanpy(
    posterior=stan_fit,
    posterior_predictive="y_hat",
    observed_data={"y": eight_school_data["y"]},
    log_likelihood="log_lik",
    coords={"school": np.arange(eight_school_data["J"])},
    dims={
        "theta": ["school"],
        "y": ["school"],
        "log_lik": ["school"],
        "y_hat": ["school"],
        "theta_tilde": ["school"],
    },
)
cmdstanpy_data
# add name of school to az.InferenceData object (i know, here I'm using college names instead of high school ...)
school_name = ['Brown University','Columbia University','Cornell University','Dartmouth College',
         'Harvard University','University of Pennsylvania','Princeton University','Yale University']
_ = cmdstanpy_data.assign_coords({'school_name': ('school', school_name)}, inplace = True)

Now, our inference data has the original index plus the name of the school:

In [18]: cmdstanpy_data.posterior
Out[18]: 
<xarray.Dataset>
Dimensions:      (chain: 4, draw: 1000, school: 8)
Coordinates:
  * chain        (chain) int64 0 1 2 3
  * draw         (draw) int64 0 1 2 3 4 5 6 7 ... 993 994 995 996 997 998 999
  * school       (school) int64 0 1 2 3 4 5 6 7
    school_name  (school) <U26 'Brown University' ... 'Yale University'
Data variables:
    mu           (chain, draw) float64 4.479 6.726 7.398 ... 6.347 3.978 4.322
    tau          (chain, draw) float64 3.315 4.993 1.385 ... 0.5634 6.937 4.53
    theta_tilde  (chain, draw, school) float64 1.264 1.593 ... 1.437 0.7122
    theta        (chain, draw, school) float64 8.668 9.76 6.454 ... 10.83 7.548
Attributes:
    created_at:                 2021-03-22T18:06:42.546473
    arviz_version:              0.11.2
    inference_library:          cmdstanpy
    inference_library_version:  0.9.68

For example:

In [28]: cmdstanpy_data.posterior.theta.median(dim = ['chain', 'draw']).to_dataframe()
Out[28]: 
                       school_name     theta
school                                      
0                 Brown University  5.640080
1              Columbia University  4.909755
2               Cornell University  4.192675
3                Dartmouth College  4.606585
4               Harvard University  3.744870
5       University of Pennsylvania  4.168245
6             Princeton University  5.800960
7                  Yale University  4.675625

Relevant documentation or public examples

Please provide documentation, public examples, or any additional information which may be relevant to your question

OriolAbril commented 3 years ago

disclaimer: I started writing before seeing the edit, will try to come back later and be more specific. I don't think I need to add anything, you already got everything :smile: I'll just leave this here as proof that the same works for multidimensional coordinates.

I think there are no examples for now, and I think we don't even have example inferencedata objects available that have more more than 3 dims (counting chain and draw), which would be great to show this kind of things in the documentation. Do you have (and can share) or know of some public examples that we could use for that?

Getting multidimensional coordinates after having created the inferencedata should be possible and work basically like in pure xarray, getting multidimensional coordinates added directly at inferencedata creation is currently not possible (I think, but have not tried). Following this section in xarray docs, I believe the following should work. Assuming in inferencedata with posterior like:

<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
  * x               (x) int64 0 1...
  * y               (y) int64 0 1...
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Data variables:
    temperature     (chain, draw, x, y, time) float64 11.04 23.57 20.77 ... 6.301 9.61 15.91

then doing:

idata.posterior.coords["lat"] = (("x", "y"), lat)
idata.posterior.coords["lat"] = (("x", "y"), lat)
# or
idata.assign_coords({"lat": (("x", "y"), lat), "lon": (("x", "y"), lat)}, groups="posterior")
# would also work if done on idata.posterior instead of idata
# using idata is good for convenience itself and to add the same coords to multiple groups at the same time

should end up in:

<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * x               (x) int64 0 1...
  * y               (y) int64 0 1...
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Data variables:
    temperature     (chain, draw, x, y, time) float64 11.04 23.57 20.77 ... 6.301 9.61 15.91

I should also note that there is an ArviZ specific con to non indexing coordinates, which is that they are not available to labellers.

jburos commented 3 years ago

On this topic:

I should also note that there is an ArviZ specific con to non indexing coordinates, which is that they are not available to labellers.

Yes. Thank you .. I noticed this, but I haven't carried through all the implications of it just yet [i saw from issues that you are working on the doc for this]. I wanted to note that the xarray.swap_dims operation made it pretty easy to switch from one set of coordinates to another.

In my example above, this looked like:

cmdstanpy_data.posterior.theta.swap_dims(school = 'school_name')

I know of a few examples with multiple indices used in Stan development; many are written up in the Stan case studies .. for example the HMM model for basketball players.