arviz-devs / arviz

Exploratory analysis of Bayesian models with Python
https://python.arviz.org
Apache License 2.0
1.58k stars 394 forks source link

add test for from_cmdstanpy to distinguish between vectors of length 1 and scalars #1646

Open jburos opened 3 years ago

jburos commented 3 years ago

EDIT (@OriolAbril): See https://github.com/arviz-devs/arviz/issues/1646#issuecomment-1125464400 for an up to date description of the issue.


Describe the bug When fitting a model using cmdstanpy and reading it into an arviz.InferenceData object, I sometimes end up with a vector parameter with length 1.

When this happens, the io_cmdstanpy code incorrectly ignores the last dimension, causing an error when I provide coord/dims for that parameter.

To Reproduce I have created a reproducible example [here]():

# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %%

import cmdstanpy
import arviz as az

# %% [markdown]
# # Review Stan model code

# %%
model_code = '''
// from https://arviz-devs.github.io/arviz/notebooks/InferenceDataCookbook.html#From-CmdStanPy
data {
    int<lower=0> J;
    real y[J];
    real<lower=0> sigma[J];
    int<lower = 0, upper=1> prior_only;
}
parameters {
    real mu;
    real<lower=0> tau;
    real theta_tilde[J];
}
transformed parameters {
    real theta[J];
    for (j in 1:J)
        theta[j] = mu + tau * theta_tilde[j];
}
model {
    mu ~ normal(0, 5);
    tau ~ cauchy(0, 5);
    theta_tilde ~ normal(0, 1);
    if (prior_only == 0) {
        y ~ normal(theta, sigma);
    }
}
generated quantities {
    vector[J] log_lik;
    vector[J] y_hat;
    for (j in 1:J) {
        log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
        y_hat[j] = normal_rng(theta[j], sigma[j]);
    }
}
'''

# %% [markdown]
# # Fitting the model using cmdstanpy

# %%
stan_file = "eight_schools.stan"
with open(stan_file, 'x') as f:
    f.write(model_code)
model = cmdstanpy.CmdStanModel(stan_file = stan_file)

# %%
stan_data8 = dict(
            J=8,
            y=[28, 8, -3, 7, -1, 1, 18, 12],
            sigma=[15, 10, 16, 11, 9, 11, 10, 18],
            prior_only=0,
        )
stan_data1 = dict(
            J=1,
            y=[28],
            sigma=[15],
            prior_only=0,
        )
coords8 = dict(school=["A", "B", "C", "D", "E", "F", "G", "H"])
coords1 = dict(school=["a"])
dims= dict(
        theta=["school"],
        y=["school"],
        log_lik=["school"],
        y_hat=["school"],
        theta_tilde=["school"],
    )
# %%
fit8 = model.sample(data=stan_data8)
fit1 = model.sample(data=stan_data1)

# %% [markdown]
# ## Construct arviz.InferenceData object

# %%
idata8 = az.from_cmdstanpy(fit8, coords = coords8, dims = dims,
    posterior_predictive = ["y_hat"],
    prior_predictive = ['y_hat'],
    log_likelihood = ["log_lik"])

## fails!
idata1 = az.from_cmdstanpy(fit1, coords = coords1, dims = dims,
    posterior_predictive = ["y_hat"],
    prior_predictive = ['y_hat'],
    log_likelihood = ["log_lik"])

The error looks like this:

/Users/jburos/.local/share/virtualenvs/workflow2-PgZLfFHB/src/arviz/arviz/data/base.py:129: UserWarning: In variable theta_tilde, there are more dims (1) given than exist (0). Passed array should have shape (chain,draw, *shape)
  warnings.warn(
Traceback (most recent call last):
  File "/Users/jburos/projects/workflow2/docs/test_arviz_cmdstanpy.py", line 93, in <module>
    idata1 = az.from_cmdstanpy(fit1, coords = coords1, dims = dims,
  File "/Users/jburos/.local/share/virtualenvs/workflow2-PgZLfFHB/src/arviz/arviz/data/io_cmdstanpy.py", line 763, in from_cmdstanpy
    return CmdStanPyConverter(
  File "/Users/jburos/.local/share/virtualenvs/workflow2-PgZLfFHB/src/arviz/arviz/data/io_cmdstanpy.py", line 408, in to_inference_data
    "posterior": self.posterior_to_xarray(),
  File "/Users/jburos/.local/share/virtualenvs/workflow2-PgZLfFHB/src/arviz/arviz/data/base.py", line 64, in wrapped
    return func(cls)
  File "/Users/jburos/.local/share/virtualenvs/workflow2-PgZLfFHB/src/arviz/arviz/data/io_cmdstanpy.py", line 111, in posterior_to_xarray
    dict_to_dataset(
  File "/Users/jburos/.local/share/virtualenvs/workflow2-PgZLfFHB/src/arviz/arviz/data/base.py", line 302, in dict_to_dataset
    data_vars[key] = numpy_to_data_array(
  File "/Users/jburos/.local/share/virtualenvs/workflow2-PgZLfFHB/src/arviz/arviz/data/base.py", line 249, in numpy_to_data_array
    return xr.DataArray(ary, coords=coords, dims=dims)
  File "/Users/jburos/.local/share/virtualenvs/workflow2-PgZLfFHB/lib/python3.8/site-packages/xarray/core/dataarray.py", line 409, in __init__
    coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
  File "/Users/jburos/.local/share/virtualenvs/workflow2-PgZLfFHB/lib/python3.8/site-packages/xarray/core/dataarray.py", line 126, in _infer_coords_and_dims
    raise ValueError(
ValueError: different number of dimensions on data and dims: 2 vs 3
deleting tmpfiles dir: /var/folders/18/st_852gs77d86c3srp0jgrnw0000gn/T/tmpfmrn2a4u
done

I think this is happening in this line: https://github.com/arviz-devs/arviz/blob/main/arviz/data/io_cmdstanpy.py#L594-L595

Expected behavior The from_cmdstanpy method should ideally know (from the dims) that this is a 1d vector or array, and label it as such.

Additional context

arviz version: 0.11.2 (github hash: 23e14fb645431014e73ba35f940b613850d59f30) cmdstanpy version: 0.9.68 python version: 3.8.7 (default, Feb 3 2021, 06:31:03) \n[Clang 12.0.0 (clang-1200.0.32.29)]

jburos commented 3 years ago

FWIW, this example works fine if these lines are removed: https://github.com/arviz-devs/arviz/blob/main/arviz/data/io_cmdstanpy.py#L594-L595

I don't know if it will break anything else, but I can make a quick PR ..

ahartikainen commented 3 years ago

There was probably some reason for the logic.

code should check if the variable is 'scalar' or nD object.

@mitzimorris @OriolAbril

OriolAbril commented 3 years ago

I think we added that to distinguish between multidimensional arrays and vectors to fix the previous issue where everything was flattened. We need some tests to check all possible cases, scalar, 1d array of lenght 1, 1d array of length n and nd array to make sure removing this doesn't break anything.

OriolAbril commented 3 years ago

And we should probably test on length 0 arrays for sampling wrappers and reloo

bmorris3 commented 2 years ago

And we should probably test on length 0 arrays for sampling wrappers and reloo

This is a problem I'm running into with reloo since my khats is scalar. Is there a simple fix?

jburos commented 2 years ago

@bmorris3 This hasn't been integrated into the main branch due to lack of testing, but I made this change which resolved the problem for me: https://github.com/arviz-devs/arviz/compare/main...jburos:fix-cmdstanpy-length1?expand=1

I'd be curious if this fixes your issue with reloo.

bmorris3 commented 2 years ago

Thanks @jburos, I've made some tweaks that get the reloo function running when the khats is scalar, but I may have come across a bigger problem: my (Gaussian process) model produces a single log_prob rather than element-wise ones. Not sure if this is a showstopper for this arviz implementation of LOO or not.

OriolAbril commented 2 years ago

my (Gaussian process) model produces a single log_prob rather than element-wise ones

I don't think a scalar khat makes any sense, loo and waic require the pointwise log likelihood values so if you have the aggregated log likelihood term you can't really use them (even if technically they do take an array of log probabilities so ArviZ's function will run without problem one way or another). Moreover, running reloo on a single observation model instead of looping over the khats and refitting for the bad ones only should be exactly the same as refitting the model by itself, or to refitting the model with no data at all? :thinking:

IIRC, https://arviz-devs.github.io/arviz/user_guide/pystan_refitting.html uses what I mean by these length 0 arrays (when fitting the whole model, the _ex data variables are arrays of length 0). I'd recommend trying to run that example with cmdstanpy instead of pystan to see if the length 0 arrays are supported (not being tested doesn't necessarily mean not working) so that we can disentangle issues due to the scalar khat from issues with the converter and with reloo itself.

This hasn't been integrated into the main branch due to lack of testing, but I made this change which resolved the problem for me: https://github.com/arviz-devs/arviz/compare/main...jburos:fix-cmdstanpy-length1?expand=1

I wanted to give a quick look at the changes and I seem to have opened a PR from your fork @jburos :sweat_smile: . If you want and have time we'll be happy to review and help with testing but I think you should be the one to open the PR. Otherwise I think it's fine to keep the comment here in case someone else wants to take over, add the test and submit the PR themelves.

jburos commented 2 years ago

Noting here, in response to the current PR, that this bug persists with the latest version of cmdstanpy (1.0.1), and cmdstan-2.29.2.

Error output:

In [6]: idata1 = az.from_cmdstanpy(fit1, coords = coords1, dims = dims,
   ...:     posterior_predictive = ["y_hat"],
   ...:     prior_predictive = ['y_hat'],
   ...:     log_likelihood = ["log_lik"])
/Users/jburos/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/base.py:130: UserWarning: In variable theta_tilde, there are more dims (1) given than exist (0). Passed array should have shape (chain,draw, *shape)
  warnings.warn(
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 idata1 = az.from_cmdstanpy(fit1, coords = coords1, dims = dims,
      2     posterior_predictive = ["y_hat"],
      3     prior_predictive = ['y_hat'],
      4     log_likelihood = ["log_lik"])

File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/io_cmdstanpy.py:834, in from_cmdstanpy(posterior, posterior_predictive, predictions, prior, prior_predictive, observed_data, constant_data, predictions_constant_data, log_likelihood, index_origin, coords, dims, save_warmup, dtypes)
    768 def from_cmdstanpy(
    769     posterior=None,
    770     *,
   (...)
    783     dtypes=None,
    784 ):
    785     """Convert CmdStanPy data into an InferenceData object.
    786
    787     For a usage example read the
   (...)
    832     InferenceData object
    833     """
--> 834     return CmdStanPyConverter(
    835         posterior=posterior,
    836         posterior_predictive=posterior_predictive,
    837         predictions=predictions,
    838         prior=prior,
    839         prior_predictive=prior_predictive,
    840         observed_data=observed_data,
    841         constant_data=constant_data,
    842         predictions_constant_data=predictions_constant_data,
    843         log_likelihood=log_likelihood,
    844         index_origin=index_origin,
    845         coords=coords,
    846         dims=dims,
    847         save_warmup=save_warmup,
    848         dtypes=dtypes,
    849     ).to_inference_data()

File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/io_cmdstanpy.py:456, in CmdStanPyConverter.to_inference_data(self)
    446 def to_inference_data(self):
    447     """Convert all available data to an InferenceData object.
    448
    449     Note that if groups can not be created (i.e., there is no `output`, so
    450     the `posterior` and `sample_stats` can not be extracted), then the InferenceData
    451     will not have those groups.
    452     """
    453     return InferenceData(
    454         save_warmup=self.save_warmup,
    455         **{
--> 456             "posterior": self.posterior_to_xarray(),
    457             "sample_stats": self.sample_stats_to_xarray(),
    458             "posterior_predictive": self.posterior_predictive_to_xarray(),
    459             "predictions": self.predictions_to_xarray(),
    460             "prior": self.prior_to_xarray(),
    461             "sample_stats_prior": self.sample_stats_prior_to_xarray(),
    462             "prior_predictive": self.prior_predictive_to_xarray(),
    463             "observed_data": self.observed_data_to_xarray(),
    464             "constant_data": self.constant_data_to_xarray(),
    465             "predictions_constant_data": self.predictions_constant_data_to_xarray(),
    466             "log_likelihood": self.log_likelihood_to_xarray(),
    467         },
    468     )

File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/base.py:65, in requires.__call__.<locals>.wrapped(cls)
     63     if all((getattr(cls, prop_i) is None for prop_i in prop)):
     64         return None
---> 65 return func(cls)

File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/io_cmdstanpy.py:139, in CmdStanPyConverter.posterior_to_xarray(self)
    135 dims = deepcopy(self.dims) if self.dims is not None else {}
    136 coords = deepcopy(self.coords) if self.coords is not None else {}
    138 return (
--> 139     dict_to_dataset(
    140         data,
    141         library=self.cmdstanpy,
    142         coords=coords,
    143         dims=dims,
    144         index_origin=self.index_origin,
    145     ),
    146     dict_to_dataset(
    147         data_warmup,
    148         library=self.cmdstanpy,
    149         coords=coords,
    150         dims=dims,
    151         index_origin=self.index_origin,
    152     ),
    153 )

File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/base.py:307, in dict_to_dataset(data, attrs, library, coords, dims, default_dims, index_origin, skip_event_dims)
    305 data_vars = {}
    306 for key, values in data.items():
--> 307     data_vars[key] = numpy_to_data_array(
    308         values,
    309         var_name=key,
    310         coords=coords,
    311         dims=dims.get(key),
    312         default_dims=default_dims,
    313         index_origin=index_origin,
    314         skip_event_dims=skip_event_dims,
    315     )
    316 return xr.Dataset(data_vars=data_vars, attrs=make_attrs(attrs=attrs, library=library))

File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/arviz/data/base.py:254, in numpy_to_data_array(ary, var_name, coords, dims, default_dims, index_origin, skip_event_dims)
    252 # filter coords based on the dims
    253 coords = {key: xr.IndexVariable((key,), data=np.asarray(coords[key])) for key in dims}
--> 254 return xr.DataArray(ary, coords=coords, dims=dims)

File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/xarray/core/dataarray.py:402, in DataArray.__init__(self, data, coords, dims, name, attrs, indexes, fastpath)
    400 data = _check_data_shape(data, coords, dims)
    401 data = as_compatible_data(data)
--> 402 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
    403 variable = Variable(dims, data, attrs, fastpath=True)
    404 indexes = dict(
    405     _extract_indexes_from_coords(coords)
    406 )  # needed for to_dataset

File ~/.local/share/virtualenvs/test_arviz-tjV-RA7Q/lib/python3.8/site-packages/xarray/core/dataarray.py:121, in _infer_coords_and_dims(shape, coords, dims)
    119     dims = tuple(dims)
    120 elif len(dims) != len(shape):
--> 121     raise ValueError(
    122         "different number of dimensions on data "
    123         f"and dims: {len(shape)} vs {len(dims)}"
    124     )
    125 else:
    126     for d in dims:

ValueError: different number of dimensions on data and dims: 2 vs 3

In [7]: cmdstanpy.__version__
Out[7]: '1.0.1'
OriolAbril commented 2 years ago

The issue should be now fixed (thanks to #2023) but we should still add a test for this situation.