Open OriolAbril opened 1 month ago
Follow-up questions.
DataTree
what kind of maintenance burden are we looking at for the InferenceData
object?My personal opinion would be to create a tracking issue and fill it with ideas, even if they remain in the backlog for an extended period of time.
If we implement the full functionality of DataTree what kind of maintenance burden are we looking at for the InferenceData object?
InferenceData as a python class is gone in arviz-base
, we have a method in arviz
to convert inferencedata to datatree, and the files generated via InferenceData.to_netcdf/zarr
can be read into either data structure without problem. Stopping to use InferenceData
and using DataTree
instead has the advantage of less maintenance burden and more features which is always nice.
This issue is about how to handle one new features, there are some which we can immedietaly take advantage of such as DataTree.to_zarr which is way more flexible and powerful than InferenceData.to_zarr, others such as the possibility of nested groups are there, but I am not sure if they are worth using in ArviZ.
I see this a bit like when working on array functions, defining which shapes to support. The input type will be array, but not all arrays are valid inputs.
A somewhat related example in arviz.rhat
we check the input has at least 2 dimensions and the two dimensions being reduced (chain and draw) have at least length 2 for chain and 4 for draw. If the input is a DataArray we don't limit the input to strictly 2d, any number of extra dimensions is supported and the computation is batched on these dimensions. However, if it is an array then we limit to strictly 2d because the API is designed for xarray objects and it is not possible to indicate which are the axes on which to operate positionally.
In arviz-stats
we have splitted computation to array and dataarray level, with dataarray functions calling the array one. So the array has chain_axis
and draw_axis
arguments and does support n-d input now.
I see, and ty for the clarification.
Stopping to use
InferenceData
and usingDataTree
instead has the advantage of less maintenance burden and more features which is always nice.
I agree with this 100%. I do not see InferenceData
objects anywhere in arviz_base
, and it looks like we output to a DataTree
object all the time.
This issue is about how to handle new features...
We should definitely do something as discussed in #18. Then we can add new features through the accessor object
If someone needs to access a nested object from the DataTree
in order to get to an arviz accessor, then I think that is okay for a user to do, and will not be too much of a burden.
As of now, arviz-base is built on top of the
DataTree
structure, which supports arbitrary number of levels in it hierarchy, but we are still following 100% the 1 level depth structure we had withInferenceData
. Same goes for stats and plots functionality.Do we want to support nested models and if so how? There is already "support" for nested pytrees in current ArviZ, by flattening the dict and using tuples as variable names which preserve the information about the original hierarchy, it is also kind of possible to use nested models from PyMC IIRC and have the output be InferenceData, instead of nested elements the variable name uses
sub_group::var_name
. Is this limited version of support enough? Or do we want to use the full flexibility ofDataTree
?If we decide we want some support for that we need models! We need either models or functions that can generate mock data with hierarchies we'd like to support. With that we'd have to define how we want plotting to work, plot all leaves? plot only the exact current group? So we can work on making that possible.
My guess as of now is this is probably more work than the benefit it will provide. A flattening similar to what current ArviZ does is might enough. We don't really have complains for better support from PyMC users nesting models and getting the
::
variables and support for flattening pytrees thanks to @ColCarroll but even this is quite recent. I have also asked around on the "arviz-net" a couple times to be sent/shown hierarchical structures for the posterior variables and ways people would like ArviZ to take advantage of that (doing things not possible withvar_names
andfilter_vars
, or at least not easy) and there is never much of an answer, hence my prior that this isn't generally a priority.