arviz-devs / arviz-stats

Statistical computation and diagnostics for ArviZ.
https://arviz-stats.readthedocs.io
Apache License 2.0
2 stars 1 forks source link

Default behaviour of dimensions that are conceptually common yet independent dimensions #3

Open OriolAbril opened 5 months ago

OriolAbril commented 5 months ago

With xarray inputs, there are some cases where there are dimensions/coordinates that represent the same concept and therefore might make sense to call the same, but that are actually independent between variables and therefore need different names within a Dataset for xarray to not break when they do have different values. How should these behave?

Examples of such dimensions/coordinates

One such example is the mode dimension in hdi. If we have a dataset with 3 variables it is perfectly possible for the 1st to have 4 modes, the 2nd to have only 1 and the 3rd to have 2 modes.

For coordinates this happens with most plotting related functions. In general we give as an input the number of points/equivalent, but there are quantities that vary between variables. For example, we return the bandwidth used by the kde as a coordinate (as it can be different between variables but also different between coordinate values, so simple attrs can't be used), so with variables: mu, shape chain, draw we get a mu_bw scalar coordinate; theta, shape chain, draw, hierarchy, group we get a theta_bw with shape hierarchy, group.

Behaviour options that come to mind

The two main options that come to mind are:

  1. (currently implemented) always prepend these dimensions/coordinates with the variable name they refer too. Datasets require named variables so that makes the function always safe to use on them.
  2. Add an argument to toggle this behaviour on/off. In which case we'd have to choose if prepending should be the default or not.

Prepending needs to happen at the DataArray level, so the only info we have available is its name. For DataArrays themselves, the name is optional, but any DataArray stored within a Dataset must be named. Currently the implementation is if the DataArray is named, prepend the variable_name.

That means that idata.posterior.mu.azstats.hdi() will get a mu_mode dimension name instead of simply mode. In this case (or in single variable datasets) is is not necessary to prepend the variable name, hence the option for an argument to toggle the behaviour (then the dataset accessor could toggle that on yet have it off by default for example, or only toggle it on for datasets with more than one variable).

I personally think in the long run it is probably best to hardcode the prepending behaviour so things are consistent. But having it always on for datasets and always off for dataarrays would also be consistent.