arviz-devs / arviz-base

Base ArviZ features and converters
https://arviz-base.readthedocs.io/
Apache License 2.0
0 stars 2 forks source link

Support for nested models/deeper hierarchy #17

Open OriolAbril opened 2 months ago

OriolAbril commented 2 months ago

As of now, arviz-base is built on top of the DataTree structure, which supports arbitrary number of levels in it hierarchy, but we are still following 100% the 1 level depth structure we had with InferenceData. Same goes for stats and plots functionality.

Do we want to support nested models and if so how? There is already "support" for nested pytrees in current ArviZ, by flattening the dict and using tuples as variable names which preserve the information about the original hierarchy, it is also kind of possible to use nested models from PyMC IIRC and have the output be InferenceData, instead of nested elements the variable name uses sub_group::var_name. Is this limited version of support enough? Or do we want to use the full flexibility of DataTree?

If we decide we want some support for that we need models! We need either models or functions that can generate mock data with hierarchies we'd like to support. With that we'd have to define how we want plotting to work, plot all leaves? plot only the exact current group? So we can work on making that possible.

My guess as of now is this is probably more work than the benefit it will provide. A flattening similar to what current ArviZ does is might enough. We don't really have complains for better support from PyMC users nesting models and getting the :: variables and support for flattening pytrees thanks to @ColCarroll but even this is quite recent. I have also asked around on the "arviz-net" a couple times to be sent/shown hierarchical structures for the posterior variables and ways people would like ArviZ to take advantage of that (doing things not possible with var_names and filter_vars, or at least not easy) and there is never much of an answer, hence my prior that this isn't generally a priority.

amaloney commented 2 months ago

Follow-up questions.

My personal opinion would be to create a tracking issue and fill it with ideas, even if they remain in the backlog for an extended period of time.

OriolAbril commented 2 months ago

If we implement the full functionality of DataTree what kind of maintenance burden are we looking at for the InferenceData object?

InferenceData as a python class is gone in arviz-base, we have a method in arviz to convert inferencedata to datatree, and the files generated via InferenceData.to_netcdf/zarr can be read into either data structure without problem. Stopping to use InferenceData and using DataTree instead has the advantage of less maintenance burden and more features which is always nice.

This issue is about how to handle one new features, there are some which we can immedietaly take advantage of such as DataTree.to_zarr which is way more flexible and powerful than InferenceData.to_zarr, others such as the possibility of nested groups are there, but I am not sure if they are worth using in ArviZ.

I see this a bit like when working on array functions, defining which shapes to support. The input type will be array, but not all arrays are valid inputs.

A somewhat related example in arviz.rhat we check the input has at least 2 dimensions and the two dimensions being reduced (chain and draw) have at least length 2 for chain and 4 for draw. If the input is a DataArray we don't limit the input to strictly 2d, any number of extra dimensions is supported and the computation is batched on these dimensions. However, if it is an array then we limit to strictly 2d because the API is designed for xarray objects and it is not possible to indicate which are the axes on which to operate positionally.

In arviz-stats we have splitted computation to array and dataarray level, with dataarray functions calling the array one. So the array has chain_axis and draw_axis arguments and does support n-d input now.

amaloney commented 2 months ago

I see, and ty for the clarification.

Stopping to use InferenceData and using DataTree instead has the advantage of less maintenance burden and more features which is always nice.

I agree with this 100%. I do not see InferenceData objects anywhere in arviz_base, and it looks like we output to a DataTree object all the time.

This issue is about how to handle new features...

We should definitely do something as discussed in #18. Then we can add new features through the accessor object

amaloney commented 2 months ago

If someone needs to access a nested object from the DataTree in order to get to an arviz accessor, then I think that is okay for a user to do, and will not be too much of a burden.