Open sethaxen opened 3 years ago
Yes, we should default to a vector of vector respective svector draws for multivariate distributions per default, but with the feature to allow rand!
to write in arrays or reinterpreted arrays instead of vector of (s)vectors.
Offhand I see a few considerations worth... considering.
For scalar distributions, there are big advantages in MeasureTheory over Distributions for log-density computations. But for random sampling, this is less clear. For now, a lot of this goes through distproxy
and just reuses code from Distributions. Would this require rebuilding or copying all of that?
As for the multivariate case, I think we can have the contiguous memory benefits of the Distributions approach without the weird semantics. In TupleVectors, you can do
julia> x = TupleVectors.chainvec(randn(3), 10)
10-element ArraysOfArrays.ArrayOfSimilarArrays{Float64, 1, 1, 2, ElasticArrays.ElasticMatrix{Float64, Vector{Float64}}}:
[0.2543955996319798, 0.5657592060457689, -1.4845945938531642]
[0.0, 0.0, 0.0]
[0.0, 0.0, 0.0]
[0.0, 0.0, 0.0]
[0.0, 0.0, 0.0]
[0.0, 0.0, 0.0]
[0.0, 0.0, 0.0]
[0.0, 0.0, 0.0]
[0.0, 0.0, 0.0]
[0.0, 0.0, 0.0]
julia> x.data
3×10 ElasticArrays.ElasticMatrix{Float64, Vector{Float64}}:
0.254396 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.565759 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
-1.48459 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
So in terms of the memory layout it's the same as Distributions, but we can still do e.g.
julia> x[3]
3-element view(::ElasticArrays.ElasticMatrix{Float64, Vector{Float64}}, :, 3) with eltype Float64:
0.0
0.0
0.0
This also allows for preallocation, meaning we need low-level functionality for rand!
@mschauer by svector
do you mean StaticArrays.SVector
? I think we should allow for these, but default to SizedVector
when the size is known statically. As I understand, SVector
s are internally represented as tuples, so you get the compile-time benefits of this but also the cost. This isn't settled yet, so of course we can discuss more.
For scalar distributions, there are big advantages in MeasureTheory over Distributions for log-density computations. But for random sampling, this is less clear. For now, a lot of this goes through
distproxy
and just reuses code from Distributions. Would this require rebuilding or copying all of that?
Probably not. Distributions.jl implements samplers for at least some distributions that can benefit from it, so for those distributions at least, we could just call their samplers. See https://github.com/JuliaStats/Distributions.jl/tree/master/src/samplers
As for the multivariate case, I think we can have the contiguous memory benefits of the Distributions approach without the weird semantics. In TupleVectors, you can do
How can we do something like this without making TupleVectors a dependency?
This also allows for preallocation, meaning we need low-level functionality for
rand!
Great! We were discussing having that in #129 anyways.
The challenge that we have relative to Distributions is potential for combinatorial explosion. For example, we might have 3 parameterizations of a distribution, each having their own most efficient ways to draw samples. Now if we need to implement 3 rand!
methods each to support drawing one draw, drawing a vector of draws, or populating an array whose rows are draws, this quickly becomes unwieldy. So it would be good to have a robust set of default methods that call each other, so ideally one only needs to implement a single method (but could implement more for greater performance).
Probably not. Distributions.jl implements samplers for at least some distributions that can benefit from it, so for those distributions at least, we could just call their samplers. See https://github.com/JuliaStats/Distributions.jl/tree/master/src/samplers
:+1:
How can we do something like this without making TupleVectors a dependency?
The chainvec
code is here:
https://github.com/cscherrer/TupleVectors.jl/blob/main/src/chainvec.jl
It's no longer specific to chains, so the method should be renamed anyway. TupleVectors
is needed if we want vectors of tuples. But if we just want arrays of arrays, we'd only need ElasticArrays
and ArraysOfArrays
.
The challenge that we have relative to Distributions is potential for combinatorial explosion. For example, we might have 3 parameterizations of a distribution, each having their own most efficient ways to draw samples. Now if we need to implement 3
rand!
methods each to support drawing one draw, drawing a vector of draws, or populating an array whose rows are draws, this quickly becomes unwieldy. So it would be good to have a robust set of default methods that call each other, so ideally one only needs to implement a single method (but could implement more for greater performance).
Right, there might be one parameterization that's most efficient for random sampling, and we could have the Sampler convert to that. Also, affine transformations will cover lots of cases, just as you only need randn
without the mean and std, or rand
without the upper and lower bounds. More generally anything defined using transformations can be sampled by applying the transformation to each sample of the untransformed version.
I really dislike the current state of this, sampling should just provide an iterator and collecting the samples or writing the samples into a container is a problem of the caller, not the callee. But we have no nice idiom for iterating over vectors and writing them into a matrix or a vector of svectors efficiently, even if that has nothing to do with sampling and only with collection.
I think the core of the problem here is that despite that
there are no common idioms for getting both at once.
The idea of iterating over vectors and having the caller aggregate them sounds good at first, but without doing this carefully each Vector created would allocate.
The ideal solution to this would probably be a purely functional interface for representing array computations. LazyArrays seems close to this, but we'll need nondeterminism with data sharing, and I'm not sure whether LazyArrays can handle that. We'll also need to be able to hand-tune things in some cases, e.g. with LoopVectorization, and I'm not sure about doing that with LazyArrays. Anyway, there's the usual risk of unknowns, and also that it could become a bigger project than MeasureTheory itself.
So to be less ambitious, but to have easier control over what we're doing, the best approach I can see is to do all array sampling in terms of rand!
. The caller (which might be rand
) creates an array and passes it to rand!
, which fills it. Along the way it might require more calls to rand!
using views of the original array.
All of this gets weird if you have, say, a vector of named tuples, and one of the entries in the named tuple is an array. By default those "inner" arrays would all be separately allocated. TupleVectors helps with this using the array-of-arrays representation. So if you have a TupleVector where one element is (a = 1, b = [2, 3], c = [4 5; 6 7])
, internally it will just be a vector (for the a
values across the vector), a matrix, and a 3-array. Then you can again allocate just once and fill with rand!
@mschauer does this address your concerns? It's not clear to me what "I really dislike the current state of this" refers to, since I think Distributions etc has the same problem you describe.
Yeah, that was general frustration, even more directed at Base
than at Distributions.jl
If we implement the
Random.Sampler
interface, then we can draw IID samples from the distributions: https://docs.julialang.org/en/v1/stdlib/Random/#Generating-random-values-of-custom-types. Distributions tends to make scalar and array distributions returnArray
s, but should we instead return a vector of draws, e.g. a vector of arrays?