arviz-devs / InferenceObjects.jl

Storage for results of Bayesian inference
https://julia.arviz.org/InferenceObjects
MIT License
14 stars 1 forks source link

Changing the default dimension order #8

Closed sethaxen closed 2 years ago

sethaxen commented 2 years ago

Currently by default we interpret the dimensions of arrays as (chain, draw, param_dims...). I propose a breaking change that uses the new default ordering (param_dims..., draw, chain) for several reasons.

TL/DR: The main arguments for this are motivated by efficient memory usage in Julia. The main arguments against this are consistency with Python and current Julia code, but when using deserialized data, this choice actually makes us more consistent with how Python ArviZ does things.

To demonstrate the reasons for doing this, let's generate "data" in a sensible default structure we might expect from a PPL. When sampling with MCMC, each chain is usually sampled separately (whether parallel or in serial), so the draws are separate in memory. For a given chain, the draws are taken in a sequence, so by default for a single draw, the values of the different parameters are stored adjacent to each other in memory. So suppose we have 2 parameters a and b with sizes (3,) and (2, 2). Our samples might then take the form:

julia> nchains = 4;

julia> ndraws = 10;

julia> nparams = 7;

julia> samples = map(1:nchains) do chain
           map(1:ndraws) do draw
               a = [("a[$i]", draw, chain) for i in 1:3]
               b = [("b[$i,$j]", draw, chain) for i in 1:2, j in 1:2]
               return vcat(vec(a), vec(b))
           end
       end;

julia> samples[1][1]  # let's look at the parameters for a single chain and draw
7-element Vector{Tuple{String, Int64, Int64}}:
 ("a[1]", 1, 1)
 ("a[2]", 1, 1)
 ("a[3]", 1, 1)
 ("b[1,1]", 1, 1)
 ("b[2,1]", 1, 1)
 ("b[1,2]", 1, 1)
 ("b[2,2]", 1, 1)

Now we just simply concatenate all of these arrays in the sensible default order. i.e. we concatenate all parameter arrays for a given draw, then all draws from the same chain, then all chains.

julia> samples_cat = mapreduce(x -> reduce(vcat, x), vcat, samples)
280-element Vector{Tuple{String, Int64, Int64}}:
 ("a[1]", 1, 1)
 ("a[2]", 1, 1)
 ("a[3]", 1, 1)
 ("b[1,1]", 1, 1)
 ("b[2,1]", 1, 1)
 ("b[1,2]", 1, 1)
 ("b[2,2]", 1, 1)
 ("a[1]", 2, 1)
 ("a[2]", 2, 1)
 ("a[3]", 2, 1)
 ⋮
 ("b[1,2]", 9, 4)
 ("b[2,2]", 9, 4)
 ("a[1]", 10, 4)
 ("a[2]", 10, 4)
 ("a[3]", 10, 4)
 ("b[1,1]", 10, 4)
 ("b[2,1]", 10, 4)
 ("b[1,2]", 10, 4)
 ("b[2,2]", 10, 4)

Reason 1: the proposed ordering makes the most sense for memory layout

We've constructed a vector whose ordering of parameters keeps parameters that were close in memory (i.e. in the same array) during sampling to be at least as close in memory. But if we just reshape this array, we get the ordering proposed in this issue:

julia> reshape(samples_reshape, :, ndraws, nchains)
7×10×4 Array{Tuple{String, Int64, Int64}, 3}:
[:, :, 1] =
 ("a[1]", 1, 1)    ("a[1]", 2, 1)    ("a[1]", 3, 1)    …  ("a[1]", 9, 1)    ("a[1]", 10, 1)
 ("a[2]", 1, 1)    ("a[2]", 2, 1)    ("a[2]", 3, 1)       ("a[2]", 9, 1)    ("a[2]", 10, 1)
 ("a[3]", 1, 1)    ("a[3]", 2, 1)    ("a[3]", 3, 1)       ("a[3]", 9, 1)    ("a[3]", 10, 1)
 ("b[1,1]", 1, 1)  ("b[1,1]", 2, 1)  ("b[1,1]", 3, 1)     ("b[1,1]", 9, 1)  ("b[1,1]", 10, 1)
 ("b[2,1]", 1, 1)  ("b[2,1]", 2, 1)  ("b[2,1]", 3, 1)     ("b[2,1]", 9, 1)  ("b[2,1]", 10, 1)
 ("b[1,2]", 1, 1)  ("b[1,2]", 2, 1)  ("b[1,2]", 3, 1)  …  ("b[1,2]", 9, 1)  ("b[1,2]", 10, 1)
 ("b[2,2]", 1, 1)  ("b[2,2]", 2, 1)  ("b[2,2]", 3, 1)     ("b[2,2]", 9, 1)  ("b[2,2]", 10, 1)

[:, :, 2] =
 ("a[1]", 1, 2)    ("a[1]", 2, 2)    ("a[1]", 3, 2)    …  ("a[1]", 9, 2)    ("a[1]", 10, 2)
 ("a[2]", 1, 2)    ("a[2]", 2, 2)    ("a[2]", 3, 2)       ("a[2]", 9, 2)    ("a[2]", 10, 2)
 ("a[3]", 1, 2)    ("a[3]", 2, 2)    ("a[3]", 3, 2)       ("a[3]", 9, 2)    ("a[3]", 10, 2)
 ("b[1,1]", 1, 2)  ("b[1,1]", 2, 2)  ("b[1,1]", 3, 2)     ("b[1,1]", 9, 2)  ("b[1,1]", 10, 2)
 ("b[2,1]", 1, 2)  ("b[2,1]", 2, 2)  ("b[2,1]", 3, 2)     ("b[2,1]", 9, 2)  ("b[2,1]", 10, 2)
 ("b[1,2]", 1, 2)  ("b[1,2]", 2, 2)  ("b[1,2]", 3, 2)  …  ("b[1,2]", 9, 2)  ("b[1,2]", 10, 2)
 ("b[2,2]", 1, 2)  ("b[2,2]", 2, 2)  ("b[2,2]", 3, 2)     ("b[2,2]", 9, 2)  ("b[2,2]", 10, 2)

[:, :, 3] =
 ("a[1]", 1, 3)    ("a[1]", 2, 3)    ("a[1]", 3, 3)    …  ("a[1]", 9, 3)    ("a[1]", 10, 3)
 ("a[2]", 1, 3)    ("a[2]", 2, 3)    ("a[2]", 3, 3)       ("a[2]", 9, 3)    ("a[2]", 10, 3)
 ("a[3]", 1, 3)    ("a[3]", 2, 3)    ("a[3]", 3, 3)       ("a[3]", 9, 3)    ("a[3]", 10, 3)
 ("b[1,1]", 1, 3)  ("b[1,1]", 2, 3)  ("b[1,1]", 3, 3)     ("b[1,1]", 9, 3)  ("b[1,1]", 10, 3)
 ("b[2,1]", 1, 3)  ("b[2,1]", 2, 3)  ("b[2,1]", 3, 3)     ("b[2,1]", 9, 3)  ("b[2,1]", 10, 3)
 ("b[1,2]", 1, 3)  ("b[1,2]", 2, 3)  ("b[1,2]", 3, 3)  …  ("b[1,2]", 9, 3)  ("b[1,2]", 10, 3)
 ("b[2,2]", 1, 3)  ("b[2,2]", 2, 3)  ("b[2,2]", 3, 3)     ("b[2,2]", 9, 3)  ("b[2,2]", 10, 3)

[:, :, 4] =
 ("a[1]", 1, 4)    ("a[1]", 2, 4)    ("a[1]", 3, 4)    …  ("a[1]", 9, 4)    ("a[1]", 10, 4)
 ("a[2]", 1, 4)    ("a[2]", 2, 4)    ("a[2]", 3, 4)       ("a[2]", 9, 4)    ("a[2]", 10, 4)
 ("a[3]", 1, 4)    ("a[3]", 2, 4)    ("a[3]", 3, 4)       ("a[3]", 9, 4)    ("a[3]", 10, 4)
 ("b[1,1]", 1, 4)  ("b[1,1]", 2, 4)  ("b[1,1]", 3, 4)     ("b[1,1]", 9, 4)  ("b[1,1]", 10, 4)
 ("b[2,1]", 1, 4)  ("b[2,1]", 2, 4)  ("b[2,1]", 3, 4)     ("b[2,1]", 9, 4)  ("b[2,1]", 10, 4)
 ("b[1,2]", 1, 4)  ("b[1,2]", 2, 4)  ("b[1,2]", 3, 4)  …  ("b[1,2]", 9, 4)  ("b[1,2]", 10, 4)
 ("b[2,2]", 1, 4)  ("b[2,2]", 2, 4)  ("b[2,2]", 3, 4)     ("b[2,2]", 9, 4)  ("b[2,2]", 10, 4)

Reason 2: If all sampled parameters are returned in this format, we can use just reshapes and views to represent all parameter arrays with the proposed dimension ordering

The above array is constructed to have all flattened in a single array, but with no extra allocations, we can construct reshaped views of the individual parameter arrays in the proposed shape. Note that a stays flattened while b gets an extra dimension.

julia> a = @views samples_reshape[1:3, :, :]
3×10×4 view(::Array{Tuple{String, Int64, Int64}, 3}, 1:3, :, :) with eltype Tuple{String, Int64, Int64}:
[:, :, 1] =
 ("a[1]", 1, 1)  ("a[1]", 2, 1)  ("a[1]", 3, 1)  …  ("a[1]", 8, 1)  ("a[1]", 9, 1)  ("a[1]", 10, 1)
 ("a[2]", 1, 1)  ("a[2]", 2, 1)  ("a[2]", 3, 1)     ("a[2]", 8, 1)  ("a[2]", 9, 1)  ("a[2]", 10, 1)
 ("a[3]", 1, 1)  ("a[3]", 2, 1)  ("a[3]", 3, 1)     ("a[3]", 8, 1)  ("a[3]", 9, 1)  ("a[3]", 10, 1)

[:, :, 2] =
 ("a[1]", 1, 2)  ("a[1]", 2, 2)  ("a[1]", 3, 2)  …  ("a[1]", 8, 2)  ("a[1]", 9, 2)  ("a[1]", 10, 2)
 ("a[2]", 1, 2)  ("a[2]", 2, 2)  ("a[2]", 3, 2)     ("a[2]", 8, 2)  ("a[2]", 9, 2)  ("a[2]", 10, 2)
 ("a[3]", 1, 2)  ("a[3]", 2, 2)  ("a[3]", 3, 2)     ("a[3]", 8, 2)  ("a[3]", 9, 2)  ("a[3]", 10, 2)

[:, :, 3] =
 ("a[1]", 1, 3)  ("a[1]", 2, 3)  ("a[1]", 3, 3)  …  ("a[1]", 8, 3)  ("a[1]", 9, 3)  ("a[1]", 10, 3)
 ("a[2]", 1, 3)  ("a[2]", 2, 3)  ("a[2]", 3, 3)     ("a[2]", 8, 3)  ("a[2]", 9, 3)  ("a[2]", 10, 3)
 ("a[3]", 1, 3)  ("a[3]", 2, 3)  ("a[3]", 3, 3)     ("a[3]", 8, 3)  ("a[3]", 9, 3)  ("a[3]", 10, 3)

[:, :, 4] =
 ("a[1]", 1, 4)  ("a[1]", 2, 4)  ("a[1]", 3, 4)  …  ("a[1]", 8, 4)  ("a[1]", 9, 4)  ("a[1]", 10, 4)
 ("a[2]", 1, 4)  ("a[2]", 2, 4)  ("a[2]", 3, 4)     ("a[2]", 8, 4)  ("a[2]", 9, 4)  ("a[2]", 10, 4)
 ("a[3]", 1, 4)  ("a[3]", 2, 4)  ("a[3]", 3, 4)     ("a[3]", 8, 4)  ("a[3]", 9, 4)  ("a[3]", 10, 4)

julia> b = @views reshape(samples_reshape[4:7, :, :], 2, 2, nchains, ndraws)
2×2×4×10 reshape(view(::Array{Tuple{String, Int64, Int64}, 3}, 4:7, :, :), 2, 2, 4, 10) with eltype Tuple{String, Int64, Int64}:
[:, :, 1, 1] =
 ("b[1,1]", 1, 1)  ("b[1,2]", 1, 1)
 ("b[2,1]", 1, 1)  ("b[2,2]", 1, 1)

[:, :, 2, 1] =
 ("b[1,1]", 2, 1)  ("b[1,2]", 2, 1)
 ("b[2,1]", 2, 1)  ("b[2,2]", 2, 1)

[:, :, 3, 1] =
 ("b[1,1]", 3, 1)  ("b[1,2]", 3, 1)
 ("b[2,1]", 3, 1)  ("b[2,2]", 3, 1)

[:, :, 4, 1] =
 ("b[1,1]", 4, 1)  ("b[1,2]", 4, 1)
 ("b[2,1]", 4, 1)  ("b[2,2]", 4, 1)

[:, :, 1, 2] =
 ("b[1,1]", 5, 1)  ("b[1,2]", 5, 1)
 ("b[2,1]", 5, 1)  ("b[2,2]", 5, 1)

[:, :, 2, 2] =
 ("b[1,1]", 6, 1)  ("b[1,2]", 6, 1)
 ("b[2,1]", 6, 1)  ("b[2,2]", 6, 1)

[:, :, 3, 2] =
 ("b[1,1]", 7, 1)  ("b[1,2]", 7, 1)
 ("b[2,1]", 7, 1)  ("b[2,2]", 7, 1)

[:, :, 4, 2] =
 ("b[1,1]", 8, 1)  ("b[1,2]", 8, 1)
 ("b[2,1]", 8, 1)  ("b[2,2]", 8, 1)

[:, :, 1, 3] =
 ("b[1,1]", 9, 1)  ("b[1,2]", 9, 1)
 ("b[2,1]", 9, 1)  ("b[2,2]", 9, 1)

[:, :, 2, 3] =
 ("b[1,1]", 10, 1)  ("b[1,2]", 10, 1)
 ("b[2,1]", 10, 1)  ("b[2,2]", 10, 1)

[:, :, 3, 3] =
 ("b[1,1]", 1, 2)  ("b[1,2]", 1, 2)
 ("b[2,1]", 1, 2)  ("b[2,2]", 1, 2)

[:, :, 4, 3] =
 ("b[1,1]", 2, 2)  ("b[1,2]", 2, 2)
 ("b[2,1]", 2, 2)  ("b[2,2]", 2, 2)

[:, :, 1, 4] =
 ("b[1,1]", 3, 2)  ("b[1,2]", 3, 2)
 ("b[2,1]", 3, 2)  ("b[2,2]", 3, 2)

[:, :, 2, 4] =
 ("b[1,1]", 4, 2)  ("b[1,2]", 4, 2)
 ("b[2,1]", 4, 2)  ("b[2,2]", 4, 2)

[:, :, 3, 4] =
 ("b[1,1]", 5, 2)  ("b[1,2]", 5, 2)
 ("b[2,1]", 5, 2)  ("b[2,2]", 5, 2)

[:, :, 4, 4] =
 ("b[1,1]", 6, 2)  ("b[1,2]", 6, 2)
 ("b[2,1]", 6, 2)  ("b[2,2]", 6, 2)

[:, :, 1, 5] =
 ("b[1,1]", 7, 2)  ("b[1,2]", 7, 2)
 ("b[2,1]", 7, 2)  ("b[2,2]", 7, 2)

[:, :, 2, 5] =
 ("b[1,1]", 8, 2)  ("b[1,2]", 8, 2)
 ("b[2,1]", 8, 2)  ("b[2,2]", 8, 2)

[:, :, 3, 5] =
 ("b[1,1]", 9, 2)  ("b[1,2]", 9, 2)
 ("b[2,1]", 9, 2)  ("b[2,2]", 9, 2)

[:, :, 4, 5] =
 ("b[1,1]", 10, 2)  ("b[1,2]", 10, 2)
 ("b[2,1]", 10, 2)  ("b[2,2]", 10, 2)
...

Reason 3: We can construct common reshapes with no allocations

The 2 most common reshaping operations we need in ArviZ are 1) flatten array-valued parameters to a vector and 2) combine chain and draw to a single sample dimension. The proposed ordering allows us to do this with reshape with no extra allocations.

julia> reshape(a, :, ndraws * nchains)
3×40 reshape(view(::Array{Tuple{String, Int64, Int64}, 3}, 1:3, :, :), 3, 40) with eltype Tuple{String, Int64, Int64}:
 ("a[1]", 1, 1)  ("a[1]", 2, 1)  ("a[1]", 3, 1)  …  ("a[1]", 8, 4)  ("a[1]", 9, 4)  ("a[1]", 10, 4)
 ("a[2]", 1, 1)  ("a[2]", 2, 1)  ("a[2]", 3, 1)     ("a[2]", 8, 4)  ("a[2]", 9, 4)  ("a[2]", 10, 4)
 ("a[3]", 1, 1)  ("a[3]", 2, 1)  ("a[3]", 3, 1)     ("a[3]", 8, 4)  ("a[3]", 9, 4)  ("a[3]", 10, 4)

julia> reshape(b, :, ndraws * nchains)
4×40 reshape(view(::Array{Tuple{String, Int64, Int64}, 3}, 4:7, :, :), 4, 40) with eltype Tuple{String, Int64, Int64}:
 ("b[1,1]", 1, 1)  ("b[1,1]", 2, 1)  ("b[1,1]", 3, 1)  …  ("b[1,1]", 9, 4)  ("b[1,1]", 10, 4)
 ("b[2,1]", 1, 1)  ("b[2,1]", 2, 1)  ("b[2,1]", 3, 1)     ("b[2,1]", 9, 4)  ("b[2,1]", 10, 4)
 ("b[1,2]", 1, 1)  ("b[1,2]", 2, 1)  ("b[1,2]", 3, 1)     ("b[1,2]", 9, 4)  ("b[1,2]", 10, 4)
 ("b[2,2]", 1, 1)  ("b[2,2]", 2, 1)  ("b[2,2]", 3, 1)     ("b[2,2]", 9, 4)  ("b[2,2]", 10, 4)

julia> reshape(samples_reshape, :, ndraws * nchains)
7×40 Matrix{Tuple{String, Int64, Int64}}:
 ("a[1]", 1, 1)    ("a[1]", 2, 1)    ("a[1]", 3, 1)    …  ("a[1]", 9, 4)    ("a[1]", 10, 4)
 ("a[2]", 1, 1)    ("a[2]", 2, 1)    ("a[2]", 3, 1)       ("a[2]", 9, 4)    ("a[2]", 10, 4)
 ("a[3]", 1, 1)    ("a[3]", 2, 1)    ("a[3]", 3, 1)       ("a[3]", 9, 4)    ("a[3]", 10, 4)
 ("b[1,1]", 1, 1)  ("b[1,1]", 2, 1)  ("b[1,1]", 3, 1)     ("b[1,1]", 9, 4)  ("b[1,1]", 10, 4)
 ("b[2,1]", 1, 1)  ("b[2,1]", 2, 1)  ("b[2,1]", 3, 1)     ("b[2,1]", 9, 4)  ("b[2,1]", 10, 4)
 ("b[1,2]", 1, 1)  ("b[1,2]", 2, 1)  ("b[1,2]", 3, 1)  …  ("b[1,2]", 9, 4)  ("b[1,2]", 10, 4)
 ("b[2,2]", 1, 1)  ("b[2,2]", 2, 1)  ("b[2,2]", 3, 1)     ("b[2,2]", 9, 4)  ("b[2,2]", 10, 4)

Because DimensionalData works hard to avoid copying data unless necessary, this ordering should let these common operations be allocation-free, which is especially useful for large datasets.

Reason 4: This ordering is more consistent with how NCDatasets.jl and Zarr.jl both read serialized arrays

Julia uses column-major (i.e. Fortran) ordering, while Numpy uses row-major (i.e. C) ordering. These are opposite, so when an array is serialized into a NetCDF or Zarr format with C ordering, NCDatasets.jl and Zarr.jl deserialize it with the order of all dimensions reversed. Here's an example using InferenceObjectsNetCDF.jl:

julia> using InferenceObjectsNetCDF, HTTP, NCDatasets

julia> resp = HTTP.get("https://github.com/arviz-devs/arviz_example_data/blob/main/data/centered_eight.nc?raw=true");

julia> ds = NCDataset("centered_eight", "r"; memory = resp.body);

julia> idata = from_netcdf(ds)
InferenceData with groups:
  > posterior
  > posterior_predictive
  > sample_stats
  > prior
  > observed_data

julia> idata.posterior
Dataset with dimensions: 
  Dim{:draw} Sampled Int64[0, 1, …, 498, 499] ForwardOrdered Irregular Points,
  Dim{:chain} Sampled Int64[0, 1, 2, 3] ForwardOrdered Irregular Points,
  Dim{:school} Categorical String[Choate, Deerfield, …, St. Paul's, Mt. Hermon] Unordered
and 3 layers:
  :mu    Union{Missing, Float64} dims: Dim{:draw}, Dim{:chain} (500×4)
  :theta Union{Missing, Float64} dims: Dim{:school}, Dim{:draw}, Dim{:chain} (8×500×4)
  :tau   Union{Missing, Float64} dims: Dim{:draw}, Dim{:chain} (500×4)

with metadata OrderedCollections.OrderedDict{Symbol, Any} with 3 entries:
  :created_at                => "2019-06-21T17:36:34.398087"
  :inference_library         => "pymc3"
  :inference_library_version => "3.7"

julia> close(ds)

Here we can see that theta is deserialized in Julia with the dimensions reversed compared to the order Python serialized it in (i.e. Python wrote it in (chain, draw, school), and it is read out as (school, draw, chain). The reverse is true; if in Julia we assume arrays have dimensions in the proposed order, then when we serialize them, if they are deserialized by a Python user, they will come out in the default order used by Python ArviZ. To deliver these deserialized arrays in the same array dimensions as Python would require us to use PermutedDimsArray for all arrays, which we certainly could do.

There's a catch here. This means that if Python wrote a parameter x as (chain, draw, xdim0, xdim1), Julia would read it back as (xdim1, xdim0, draw, chain). This is a little nonintuitive, but I think it's fine.

Reason 5: This is the dimension ordering planned for MCMCDiagnosticTools

We plan to port our diagnostic functions to MCMCDiagnosticTools, and in that package we seem to be converging to the ordering proposed here as the ordering to require (see https://github.com/TuringLang/MCMCDiagnosticTools.jl/issues/5). When the proposed default ordering is used, then these functions could be called directly on our DimensionalArrays with reshaping only, so there would be no copying of data. (cc @ParadaCarleton and @devmotion in case their viewpoints have changed)

Drawbacks

There are a few drawbacks to this change, but in my opinion they are outweighed by the advantages:

  1. It's just nicer to have the "special" dimensions be first. In Python, one could always use the -1 index to mean the last dimension, but in Julia, one would use end, but this cannot be set to an index variable, and one can't use size(arr, end). I don't think this is a major issue because users should probably always be interacting with the dimensions through the DimensionalData interface and not assuming a specific ordering of dimensions.
  2. It's not consistent with Python ArviZ. I think this is fine. This change is motivated by memory layouts in the two languages, and we don't want to pretend Julia is Python.
  3. It's different from what we currently do. Anyone using from_namedtuple would need to update code. This is mitigated by marking the change as breaking, but further, in most cases an error will be raised if the user assumes the old ordering, since the sample dimensions will be misaligned.
  4. It might be nonintuitive when working with example datasets that have matrix-valued parameters, since the order of the 2 parameter dimensions will be swapped. This is currently not an issue since all example datasets have only a single parameter dimensions. This is more of an issue with serialization and the different memory layouts of the 2 languages than anything else. I guess this could motivate the alternative of (reverse(param_dims)..., draw, chain), but this would just make it so Julia always uses nonintuitive parameter dimensions, and this lacks the nice memory layout properties noted above, so I don't think this makes sense. This point is an argument for using PermutedDimsArray every time we (de)serialize data.

Most importantly, as with Python ArviZ, all code that consumes InferenceData or Dataset should work the same regardless of where chain and draw appear in the dimension ordering, so this is more about which default ordering makes the most sense in the end.

sethaxen commented 2 years ago

@cscherrer you've thought about this for SampleChains. Correct me if I'm wrong, but Soss/Tilde indeed represent draws as parameter vectors for sampling with HMC, but after applying transformations to constrained samples, draws are represented as NamedTuples of arrays. Since SampleChains is geared towards sampling using iterators, it chooses a memory layout that for a given parameter places all draws of that parameter from a single chain within a single array. For univariate parameters, this would motivate the memory layout (draw, param_dims..., chain), so this doesn't prefer either of the 2 dimension orderings discussed here. Do I have that right?

ahartikainen commented 2 years ago

No dimensions should be "special", that is the main reason with named dimensions. ArviZ Python just uses this order rule for from_xyz (one needed to be defined so everyone are on the same page).

As in python xarray, slicing/reordering dimensions (in a view) should be doable if needed.

Memory layouts are good reasons to use specific order.

cscherrer commented 2 years ago

The underlying data structure for a TupleVector (which SampleChains uses under the hood) is a (possibly nested) NamedTuple, where each leaf is an ArrayOfArrays (from the package of similar name). This is then given an interface as a Vector of NamedTuples, similar to the approach used by StructArrays.

As for why I took this approach... Say you have a model that draws from an LKJ prior. That might look something like

julia> t = as(LKJCholesky(4))
CorrCholesky(4)

julia> TV.dimension(t)
6

julia> x = randn(6)
6-element Vector{Float64}:
 -0.634943
  1.18258
 -0.742168
  0.13168
  0.0724018
 -0.162739

julia> y = TV.transform(t, x)
Cholesky{Float64, UpperTriangular{Float64, Matrix{Float64}}}
U factor:
4×4 UpperTriangular{Float64, Matrix{Float64}}:
 1.0  -0.307219   0.530824   0.0657448
  ⋅    0.951639  -0.300805   0.0361068
  ⋅     ⋅         0.792302  -0.0809617
  ⋅     ⋅          ⋅         0.993891

For evaluating the log-density, x by itself it not enough. Our model likely requires BLAS operations on y. So if we only used x, we'd constantly be allocating for y anyway.

I'd much rather have x used only for temporary storage during sampling, and represent results as they'll actually be used.

I think our next iteration this will be something like (for a fixed number of samples)

  1. Start with the vector the sampler uses internally
  2. Apply the transform, to see what each row of the result will look like
  3. Build a TupleVector around this, after which memory allocation should be (at least mostly) complete
  4. At each sampler iteration, the vector changes in place, and writes to the current row of the sample result

Oh, and about iterators... We don't only use ArraysOfArrays.jl, but also ElasticArrays.jl. This makes it so each leaf allows a push! operation, even if the underlying data is a non-vector array. push! is then propagated up to the TupleVector level, so we can extend a row at a time if we want.

cscherrer commented 2 years ago

@oschulz has also thought a lot about this, so he might have more insights

sethaxen commented 2 years ago

@cscherrer that makes sense, yes. So IIUC, in the LKJCholesky case, you would store the draws as a vector of Cholesky objects. Each of these wraps an array, so in terms of storage it's not so different from a vector of addresses to where arrays are stored, and these arrays I guess can be anywhere in memory. To me this makes a lot of sense for flexibly working with Julia types.

This certainly could fit into InferenceData. There's nothing preventing a (draw, chain) matrix of Cholesky objects (or any arbitrary object), but we would not know how to write generic functions for analysis of such an array. For InferenceData, we're concerned primarily with steps in Bayesian workflow: plotting, statistics, diagnostics, and (de)serialization (for deposition, communication, and reproducibility). For most of these, we need access to marginals, and once you have a way of getting marginals, everything fits into a multidimensional array structure with numeric eltypes, which is perfect for serialization to data structures that can be loaded by other languages with no knowledge of the Julian origins of the data. Since we can attach arbitrary metadata to a parameter, we may at some point add to the schema a standard approach for PPLs to map an array of marginals to the Julia representation the PPL prefers, but I'm not certain how useful this would be unless that PPL was also using InferenceData as a native storage type.

cscherrer commented 2 years ago

So IIUC, in the LKJCholesky case, you would store the draws as a vector of Cholesky objects. Each of these wraps an array, so in terms of storage it's not so different from a vector of addresses to where arrays are stored, and these arrays I guess can be anywhere in memory.

Currently using ArraysOfArrays, a vector of Array{T,N}s is stored as a single Array{T,N+1}. This makes one big allocation instead of lots of small ones. In some cases it might be better to just store an array of pointers. Ideally, this should be easy to switch between, but we're not there yet.

Can you say more about your (draw, chain) matrix? I think this would enforce that each chain has the same length. Is this a requirement? If you're sampling in parallel, I'd think it can make more sense to stop all samplers at the same time than to wait for the slowest one to catch up.

sethaxen commented 2 years ago

Can you say more about your (draw, chain) matrix? I think this would enforce that each chain has the same length. Is this a requirement?

No I don't think so in terms of storage. To make the storage ragged, you would need to pad the shorter arrays with missings or NaNs. And the serialization formats we use (NetCDF and Zarr) both support serializing these this way. However, currently I don't think any of our stats/diagnostic functions know how to handle these. I think plotting would still work, but there could be edge cases.

oschulz commented 2 years ago

To make the storage ragged, you would need to pad the shorter arrays

Not necessarily: ArrayOfArrays.jl provides VectorOfArrays which stores a vector of vectors/arrays of different size in a single flat vector underneath and supports appending.

sethaxen commented 2 years ago

To make the storage ragged, you would need to pad the shorter arrays

Not necessarily: ArrayOfArrays.jl provides VectorOfArrays which stores a vector of vectors/arrays of different size in a single flat vector underneath and supports appending.

Here I'm talking about our InferenceData. While you can have arbitrary eltypes and use arbitrary array types in InferenceData, to write generic functions for Bayesian workflow, we need access to marginals, which means that samples need to be flattened into a multidimensional array (generally with numeric eltypes). In that context, to support chains with different numbers of draws, you would need to pad the draws from the short chains (or at least represent the array that way).

oschulz commented 2 years ago

to write generic functions for Bayesian workflow, we need access to marginals, which means that samples need to be flattened into a multidimensional array

That is essentially what I built ValueShapes for (switching between structured and flat-array representation). I'm not sure if it can do everything you need already, but it could be modified/extended.

ValueShapes can handle vectors of samples (which get flattened into vectors of vectors, stored as a single flat matrix via ArraysOfArrays).

ParadaCarleton commented 2 years ago

@cscherrer that makes sense, yes. So IIUC, in the LKJCholesky case, you would store the draws as a vector of Cholesky objects. Each of these wraps an array, so in terms of storage it's not so different from a vector of addresses to where arrays are stored, and these arrays I guess can be anywhere in memory. To me this makes a lot of sense for flexibly working with Julia types.

This certainly could fit into InferenceData. There's nothing preventing a (draw, chain) matrix of Cholesky objects (or any arbitrary object), but we would not know how to write generic functions for analysis of such an array.

I think it sounds like a good idea to store results as a (draw, chain) matrix containing NamedTuples, using StructArrays. If necessary, we can provide a function that flattens everything out. One advantage to this is it allows for distributions that return unusual random variables. For instance, I've had cases (studying redistricting in the US house) where samples are taken from a distribution over maps, which can't exactly be easily flattened out into an array.

sethaxen commented 2 years ago

That is essentially what I built ValueShapes for (switching between structured and flat-array representation). I'm not sure if it can do everything you need already, but it could be modified/extended.

I'll need to look closely at ValueShapes to be sure I understand what it can do.

I think it sounds like a good idea to store results as a (draw, chain) matrix containing NamedTuples, using StructArrays. If necessary, we can provide a function that flattens everything out. One advantage to this is it allows for distributions that return unusual random variables. For instance, I've had cases (studying redistricting in the US house) where samples are taken from a distribution over maps, which can't exactly be easily flattened out into an array.

If we're gearing towards InferenceData being feasible as a default storage type for Julia PPLs (which I think we should be), then I think we should consider cases like this carefully. Storing types like this is already supported, but we want to be sure that the resulting InferenceData can still be used in generic downstream analysis functions with minimal developer or user grief. I can think of several ways we could support this, each with their own trade-offs.

Since this discussion is orthogonal to this issue, I'll open a new issue to discuss this.

cscherrer commented 2 years ago

I think it sounds like a good idea to store results as a (draw, chain) matrix containing NamedTuples, using StructArrays. If necessary, we can provide a function that flattens everything out. One advantage to this is it allows for distributions that return unusual random variables. For instance, I've had cases (studying redistricting in the US house) where samples are taken from a distribution over maps, which can't exactly be easily flattened out into an array.

Yes, this is exactly my concern -- "regular array of floating point values of the same type" is a very strong constraint, and it's very easy to come up with examples where it doesn't make sense.

If we're gearing towards InferenceData being feasible as a default storage type for Julia PPLs (which I think we should be), then I think we should consider cases like this carefully. Storing types like this is already supported, but we want to be sure that the resulting InferenceData can still be used in generic downstream analysis functions with minimal developer or user grief. I can think of several ways we could support this, each with their own trade-offs.

Thanks @sethaxen . I hope my earlier comment has been useful and not too much of a distraction. Looking at it again, the scope is a lot narrower than I had in mind as I was writing my response.

As for compatibility with downstream tools, it seems similar to the situation with models and samplers. We can write lots of different models, but a given sampler might constrain that space. HMC wants fixed-dimension continuous parameter spaces, etc. It's a reasonable constraint, but it's also reasonable to allow more general models than HMC can handle -- in those cases you just need a different inference method.

EDIT: We can move this comment to the new issue, once that exists.

sethaxen commented 2 years ago

Closed by #12