`az.from_emcee` fails to handle structured array of blobs

jwhhh commented 2 years ago

Describe the bug

When using an emcee.EnsembleSampler with custom dtype defined (see here), EnsembleSampler.get_blobs() returns a numpy structured array that has shape (nsteps, nwalkers). The shape information doesn't reflect the number of blobs, so EmceeConverter.blobs_to_dict() treats it as having only one blobs even when there are more.

To Reproduce

Running the following script:

# mostly copied from: https://emcee.readthedocs.io/en/stable/user/blobs/

import emcee
import numpy as np
import arviz as az

def log_prior(params):
    return -0.5 * np.sum(params**2)

def log_like(params):
    return -0.5 * np.sum((params / 0.1)**2)

def log_prob(params):
    lp = log_prior(params)
    if not np.isfinite(lp):
        return -np.inf, -np.inf, -np.inf
    ll = log_like(params)
    if not np.isfinite(ll):
        return lp, -np.inf, -np.inf
    return lp + ll, lp, np.mean(params)

coords = np.random.randn(32, 3)
nwalkers, ndim = coords.shape

# Here are the important lines
dtype = [("log_prior", float), ("mean", float)]
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob,
                                blobs_dtype=dtype)

sampler.run_mcmc(coords, 100)

blobs = sampler.get_blobs()

idata = az.from_emcee(
    sampler,
    blob_names=["log_prior", "mean"],
    blob_groups=["log_prior", "mean"],
)

yields the following error:

Traceback (most recent call last):
  File "/path/to/test.py", line 34, in <module>
    idata = az.from_emcee(
  File "/path/to/home/opt/miniforge3/envs/cofi_dev/lib/python3.10/site-packages/arviz/data/io_emcee.py", line 320, in from_emcee
    ).to_inference_data()
  File "/path/to/home/opt/miniforge3/envs/cofi_dev/lib/python3.10/site-packages/arviz/data/io_emcee.py", line 251, in to_inference_data
    blobs_dict = self.blobs_to_dict()
  File "/path/to/home/opt/miniforge3/envs/cofi_dev/lib/python3.10/site-packages/arviz/data/io_emcee.py", line 205, in blobs_to_dict
    raise ValueError(
ValueError: Incorrect number of blob names. Expected 1, found 2

Expected behavior

The blobs should be 2 dimensions. So it shouldn't expect only 1.

Additional context

MacOS v11.5.1 python version: 3.10.4 arviz version: v0.12.1, conda-forge emcee version: v3.1.2, conda-forge numpy version: v1.22.3, conda-forge

OriolAbril commented 2 years ago

Does it work if you don't use the blobs_dtype argument and leave everything else unchanged?

Do you have any ideas on how to improve this? I am not very familiar with custom dtypes. What do you expect should happen in that case? What does xarray do when given a numpy array with custom dtypes?

jwhhh commented 2 years ago

Thanks for replying. It works perfectly well if I don't use blobs_dtype.

I played around it a bit just now, and believe that we can add some checking around this line. Something like:

if len(blobs.dtype) == 0:
    # continue as usual
    nblobs, nwalkers, ndraws, *_ = blobs.shape
else:
    nwalkers, ndraws, *_ = blobs.shape
    nblobs = len(blobs.dtype)

I'm expecting it to detect the dimension, and the names set by az.from_emcee(sampler, blob_names=[...], blob_groups=[...] to overwrite the dtype names set from emcee.EnsembleSampler(blobs_dtype=...). However I'm not sure about the types on the xarray side. It might be complicated if we have to deal with types here. What I was expecting is only that it can recognise that there can be more than 1 blobs attribute in this case.

OriolAbril commented 2 years ago

I have played a little bit with xarray and custom dtypes:

>>> a = np.array([(-10, .2), (-15, .4)], dtype=[("log_prior", float), ("mean", float)])

>>> a
array([(-10., 0.2), (-15., 0.4)],
      dtype=[('log_prior', '<f8'), ('mean', '<f8')])

>>> a.shape
(2,)

>>> xr.DataArray(a)
<xarray.DataArray (dim_0: 2)>
array([(-10., 0.2), (-15., 0.4)],
      dtype=[('log_prior', '<f8'), ('mean', '<f8')])
Dimensions without coordinates: dim_0

xarray closely follows numpy. The array a has shape (2,) in both cases, so it would be a complete change in behaviour for ArviZ to take that as a (2, 2) array. According to both numpy and xarray there is no dimension to be detected. Moreover, doing this "expansion" of custom dtype to a dimension would also force us to copy the data into the new (here 2x2) array with float dtype, something which will not always be possible. Numpy arrays have the same dtype on all positions, but with custom dtypes one can use [("log_prior", float), ("sum", int)] (and other combinations), here the int can be promoted to float but we are changing important info of the data and it might not always be possible to do this either.

I don't think we should change how this works with regards to dtype. If you think it could help we could add a note to https://python.arviz.org/en/latest/getting_started/ConversionGuideEmcee.html explaining the behaviour with blobs using custom dtypes.

arviz-devs / arviz

`az.from_emcee` fails to handle structured array of blobs #2036