Closed sethaxen closed 2 years ago
@cscherrer you've thought about this for SampleChains. Correct me if I'm wrong, but Soss/Tilde indeed represent draws as parameter vectors for sampling with HMC, but after applying transformations to constrained samples, draws are represented as NamedTuples of arrays. Since SampleChains is geared towards sampling using iterators, it chooses a memory layout that for a given parameter places all draws of that parameter from a single chain within a single array. For univariate parameters, this would motivate the memory layout (draw, param_dims..., chain)
, so this doesn't prefer either of the 2 dimension orderings discussed here. Do I have that right?
No dimensions should be "special", that is the main reason with named dimensions. ArviZ Python just uses this order rule for from_xyz
(one needed to be defined so everyone are on the same page).
As in python xarray, slicing/reordering dimensions (in a view) should be doable if needed.
Memory layouts are good reasons to use specific order.
The underlying data structure for a TupleVector (which SampleChains uses under the hood) is a (possibly nested) NamedTuple, where each leaf is an ArrayOfArrays (from the package of similar name). This is then given an interface as a Vector of NamedTuples, similar to the approach used by StructArrays.
As for why I took this approach... Say you have a model that draws from an LKJ
prior. That might look something like
julia> t = as(LKJCholesky(4))
CorrCholesky(4)
julia> TV.dimension(t)
6
julia> x = randn(6)
6-element Vector{Float64}:
-0.634943
1.18258
-0.742168
0.13168
0.0724018
-0.162739
julia> y = TV.transform(t, x)
Cholesky{Float64, UpperTriangular{Float64, Matrix{Float64}}}
U factor:
4×4 UpperTriangular{Float64, Matrix{Float64}}:
1.0 -0.307219 0.530824 0.0657448
⋅ 0.951639 -0.300805 0.0361068
⋅ ⋅ 0.792302 -0.0809617
⋅ ⋅ ⋅ 0.993891
For evaluating the log-density, x
by itself it not enough. Our model likely requires BLAS operations on y
. So if we only used x
, we'd constantly be allocating for y
anyway.
I'd much rather have x
used only for temporary storage during sampling, and represent results as they'll actually be used.
I think our next iteration this will be something like (for a fixed number of samples)
Oh, and about iterators... We don't only use ArraysOfArrays.jl, but also ElasticArrays.jl. This makes it so each leaf allows a push!
operation, even if the underlying data is a non-vector array. push!
is then propagated up to the TupleVector level, so we can extend a row at a time if we want.
@oschulz has also thought a lot about this, so he might have more insights
@cscherrer that makes sense, yes. So IIUC, in the LKJCholesky
case, you would store the draws as a vector of Cholesky
objects. Each of these wraps an array, so in terms of storage it's not so different from a vector of addresses to where arrays are stored, and these arrays I guess can be anywhere in memory. To me this makes a lot of sense for flexibly working with Julia types.
This certainly could fit into InferenceData
. There's nothing preventing a (draw, chain)
matrix of Cholesky
objects (or any arbitrary object), but we would not know how to write generic functions for analysis of such an array. For InferenceData
, we're concerned primarily with steps in Bayesian workflow: plotting, statistics, diagnostics, and (de)serialization (for deposition, communication, and reproducibility). For most of these, we need access to marginals, and once you have a way of getting marginals, everything fits into a multidimensional array structure with numeric eltypes, which is perfect for serialization to data structures that can be loaded by other languages with no knowledge of the Julian origins of the data. Since we can attach arbitrary metadata to a parameter, we may at some point add to the schema a standard approach for PPLs to map an array of marginals to the Julia representation the PPL prefers, but I'm not certain how useful this would be unless that PPL was also using InferenceData
as a native storage type.
So IIUC, in the LKJCholesky case, you would store the draws as a vector of Cholesky objects. Each of these wraps an array, so in terms of storage it's not so different from a vector of addresses to where arrays are stored, and these arrays I guess can be anywhere in memory.
Currently using ArraysOfArrays, a vector of Array{T,N}
s is stored as a single Array{T,N+1}
. This makes one big allocation instead of lots of small ones. In some cases it might be better to just store an array of pointers. Ideally, this should be easy to switch between, but we're not there yet.
Can you say more about your (draw, chain)
matrix? I think this would enforce that each chain has the same length. Is this a requirement? If you're sampling in parallel, I'd think it can make more sense to stop all samplers at the same time than to wait for the slowest one to catch up.
Can you say more about your
(draw, chain)
matrix? I think this would enforce that each chain has the same length. Is this a requirement?
No I don't think so in terms of storage. To make the storage ragged, you would need to pad the shorter arrays with missing
s or NaN
s. And the serialization formats we use (NetCDF and Zarr) both support serializing these this way. However, currently I don't think any of our stats/diagnostic functions know how to handle these. I think plotting would still work, but there could be edge cases.
To make the storage ragged, you would need to pad the shorter arrays
Not necessarily: ArrayOfArrays.jl provides VectorOfArrays
which stores a vector of vectors/arrays of different size in a single flat vector underneath and supports appending.
To make the storage ragged, you would need to pad the shorter arrays
Not necessarily: ArrayOfArrays.jl provides
VectorOfArrays
which stores a vector of vectors/arrays of different size in a single flat vector underneath and supports appending.
Here I'm talking about our InferenceData
. While you can have arbitrary eltypes and use arbitrary array types in InferenceData
, to write generic functions for Bayesian workflow, we need access to marginals, which means that samples need to be flattened into a multidimensional array (generally with numeric eltypes). In that context, to support chains with different numbers of draws, you would need to pad the draws from the short chains (or at least represent the array that way).
to write generic functions for Bayesian workflow, we need access to marginals, which means that samples need to be flattened into a multidimensional array
That is essentially what I built ValueShapes for (switching between structured and flat-array representation). I'm not sure if it can do everything you need already, but it could be modified/extended.
ValueShapes can handle vectors of samples (which get flattened into vectors of vectors, stored as a single flat matrix via ArraysOfArrays).
@cscherrer that makes sense, yes. So IIUC, in the
LKJCholesky
case, you would store the draws as a vector ofCholesky
objects. Each of these wraps an array, so in terms of storage it's not so different from a vector of addresses to where arrays are stored, and these arrays I guess can be anywhere in memory. To me this makes a lot of sense for flexibly working with Julia types.This certainly could fit into
InferenceData
. There's nothing preventing a(draw, chain)
matrix ofCholesky
objects (or any arbitrary object), but we would not know how to write generic functions for analysis of such an array.
I think it sounds like a good idea to store results as a (draw, chain)
matrix containing NamedTuples
, using StructArrays. If necessary, we can provide a function that flattens everything out. One advantage to this is it allows for distributions that return unusual random variables. For instance, I've had cases (studying redistricting in the US house) where samples are taken from a distribution over maps, which can't exactly be easily flattened out into an array.
That is essentially what I built ValueShapes for (switching between structured and flat-array representation). I'm not sure if it can do everything you need already, but it could be modified/extended.
I'll need to look closely at ValueShapes
to be sure I understand what it can do.
I think it sounds like a good idea to store results as a
(draw, chain)
matrix containingNamedTuples
, using StructArrays. If necessary, we can provide a function that flattens everything out. One advantage to this is it allows for distributions that return unusual random variables. For instance, I've had cases (studying redistricting in the US house) where samples are taken from a distribution over maps, which can't exactly be easily flattened out into an array.
If we're gearing towards InferenceData
being feasible as a default storage type for Julia PPLs (which I think we should be), then I think we should consider cases like this carefully. Storing types like this is already supported, but we want to be sure that the resulting InferenceData
can still be used in generic downstream analysis functions with minimal developer or user grief. I can think of several ways we could support this, each with their own trade-offs.
Since this discussion is orthogonal to this issue, I'll open a new issue to discuss this.
I think it sounds like a good idea to store results as a
(draw, chain)
matrix containingNamedTuples
, using StructArrays. If necessary, we can provide a function that flattens everything out. One advantage to this is it allows for distributions that return unusual random variables. For instance, I've had cases (studying redistricting in the US house) where samples are taken from a distribution over maps, which can't exactly be easily flattened out into an array.
Yes, this is exactly my concern -- "regular array of floating point values of the same type" is a very strong constraint, and it's very easy to come up with examples where it doesn't make sense.
If we're gearing towards
InferenceData
being feasible as a default storage type for Julia PPLs (which I think we should be), then I think we should consider cases like this carefully. Storing types like this is already supported, but we want to be sure that the resultingInferenceData
can still be used in generic downstream analysis functions with minimal developer or user grief. I can think of several ways we could support this, each with their own trade-offs.
Thanks @sethaxen . I hope my earlier comment has been useful and not too much of a distraction. Looking at it again, the scope is a lot narrower than I had in mind as I was writing my response.
As for compatibility with downstream tools, it seems similar to the situation with models and samplers. We can write lots of different models, but a given sampler might constrain that space. HMC wants fixed-dimension continuous parameter spaces, etc. It's a reasonable constraint, but it's also reasonable to allow more general models than HMC can handle -- in those cases you just need a different inference method.
EDIT: We can move this comment to the new issue, once that exists.
Closed by #12
Currently by default we interpret the dimensions of arrays as
(chain, draw, param_dims...)
. I propose a breaking change that uses the new default ordering(param_dims..., draw, chain)
for several reasons.TL/DR: The main arguments for this are motivated by efficient memory usage in Julia. The main arguments against this are consistency with Python and current Julia code, but when using deserialized data, this choice actually makes us more consistent with how Python ArviZ does things.
To demonstrate the reasons for doing this, let's generate "data" in a sensible default structure we might expect from a PPL. When sampling with MCMC, each chain is usually sampled separately (whether parallel or in serial), so the draws are separate in memory. For a given chain, the draws are taken in a sequence, so by default for a single draw, the values of the different parameters are stored adjacent to each other in memory. So suppose we have 2 parameters
a
andb
with sizes(3,)
and(2, 2)
. Our samples might then take the form:Now we just simply concatenate all of these arrays in the sensible default order. i.e. we concatenate all parameter arrays for a given draw, then all draws from the same chain, then all chains.
Reason 1: the proposed ordering makes the most sense for memory layout
We've constructed a vector whose ordering of parameters keeps parameters that were close in memory (i.e. in the same array) during sampling to be at least as close in memory. But if we just reshape this array, we get the ordering proposed in this issue:
Reason 2: If all sampled parameters are returned in this format, we can use just reshapes and views to represent all parameter arrays with the proposed dimension ordering
The above array is constructed to have all flattened in a single array, but with no extra allocations, we can construct reshaped views of the individual parameter arrays in the proposed shape. Note that
a
stays flattened whileb
gets an extra dimension.Reason 3: We can construct common reshapes with no allocations
The 2 most common reshaping operations we need in ArviZ are 1) flatten array-valued parameters to a vector and 2) combine
chain
anddraw
to a singlesample
dimension. The proposed ordering allows us to do this withreshape
with no extra allocations.Because DimensionalData works hard to avoid copying data unless necessary, this ordering should let these common operations be allocation-free, which is especially useful for large datasets.
Reason 4: This ordering is more consistent with how NCDatasets.jl and Zarr.jl both read serialized arrays
Julia uses column-major (i.e. Fortran) ordering, while Numpy uses row-major (i.e. C) ordering. These are opposite, so when an array is serialized into a NetCDF or Zarr format with C ordering, NCDatasets.jl and Zarr.jl deserialize it with the order of all dimensions reversed. Here's an example using InferenceObjectsNetCDF.jl:
Here we can see that
theta
is deserialized in Julia with the dimensions reversed compared to the order Python serialized it in (i.e. Python wrote it in(chain, draw, school)
, and it is read out as(school, draw, chain)
. The reverse is true; if in Julia we assume arrays have dimensions in the proposed order, then when we serialize them, if they are deserialized by a Python user, they will come out in the default order used by Python ArviZ. To deliver these deserialized arrays in the same array dimensions as Python would require us to usePermutedDimsArray
for all arrays, which we certainly could do.There's a catch here. This means that if Python wrote a parameter
x
as(chain, draw, xdim0, xdim1)
, Julia would read it back as(xdim1, xdim0, draw, chain)
. This is a little nonintuitive, but I think it's fine.Reason 5: This is the dimension ordering planned for MCMCDiagnosticTools
We plan to port our diagnostic functions to MCMCDiagnosticTools, and in that package we seem to be converging to the ordering proposed here as the ordering to require (see https://github.com/TuringLang/MCMCDiagnosticTools.jl/issues/5). When the proposed default ordering is used, then these functions could be called directly on our
DimensionalArray
s with reshaping only, so there would be no copying of data. (cc @ParadaCarleton and @devmotion in case their viewpoints have changed)Drawbacks
There are a few drawbacks to this change, but in my opinion they are outweighed by the advantages:
end
, but this cannot be set to an index variable, and one can't usesize(arr, end)
. I don't think this is a major issue because users should probably always be interacting with the dimensions through the DimensionalData interface and not assuming a specific ordering of dimensions.from_namedtuple
would need to update code. This is mitigated by marking the change as breaking, but further, in most cases an error will be raised if the user assumes the old ordering, since the sample dimensions will be misaligned.(reverse(param_dims)..., draw, chain)
, but this would just make it so Julia always uses nonintuitive parameter dimensions, and this lacks the nice memory layout properties noted above, so I don't think this makes sense. This point is an argument for usingPermutedDimsArray
every time we (de)serialize data.Most importantly, as with Python ArviZ, all code that consumes
InferenceData
orDataset
should work the same regardless of wherechain
anddraw
appear in the dimension ordering, so this is more about which default ordering makes the most sense in the end.