Open ElOceanografo opened 1 week ago
Thanks for raising this. I agree that the return value of mode estimation could and should be improved. ComponentArrays would be a good candidate. There are some other elements of the return value that we would also like to change, mentioned in #2232.
Putting this on our todo list, though unsure when we'll get to it.
If there's interest in the meantime, I'd be willing to make a PR for this feature. It doesn't look like it requires many code changes and would be a huge increase in usability, at least for me...
The issue with ComponentArrays.jl is that it effectively takes a NamedTuple
approach. This has the benefit that it's very fast to do something like x_opt.x
(+ you can infer the type it returns, so you can do more interesting things than just returning Vector{Float64}
without ruining type inference), but it's really not the ideal thing to do when you have tons of variables. As an example, if I wrote the following model
@model function demo()
x = Vector(undef, 100)
for i in eachindex(x)
x[i] ~ Normal()
end
end
If you "naively" convert the variables here as they occur in the model, x[1]
, x[2]
, etc., you'll end up with a "namedtuple" representation with 100 entries, which is really not great. NamedArrays.jl does however NOT do this, but instead uses a less performant but more "flexible" on the index side since it just uses an array of strings.
In short, they each have respective pros and cons, and we should probably use a "customized" solution for this.
An alternative route @mhauru is to maybe just use VarNamedVector
for this.
I'm not sure I quite understand--the whole point would be to get that array x
out as an array, not as 100 individual variables named x[1]
, x[2]
, etc.
Are you saying that in a model like this,
@model function demo()
a ~ Normal()
b ~ Normal()
...
z ~ Normal()
a1 ~ Normal()
...
z1 ~ Normal()
...
a100 ~ Normal()
...
z100 ~ Normal()
end
where there are many, many individually named variables (2,600 in this case) the performance of a NamedTuple
with entries for all of them will suffer?
Ultimately, it doesn't actually matter whether this hypothetical return structure is a ComponentArray
or something else, as long as it doesn't require the user to do the kind of annoying and error-prone manual extraction I showed in my first example.
where there are many, many individually named variables (2,600 in this case) the performance of a NamedTuple with entries for all of them will suffer?
Yep.
But a more likely scenario is the model I mentioned above, i.e.
model function demo()
x = Vector(undef, 100)
for i in eachindex(x)
x[i] ~ Normal()
end
end
The problem is that x[i]
is the only thing that is "seen" by Turing.jl. Turing.jl doesn't "care" about the fact that x
is a Vector
; for example, the following model would result in exactly the same variables x[1], x[2], ...
in the chain:
model function demo()
x = Matrix(undef, 1, 100)
for i in eachindex(x)
x[i] ~ Normal()
end
end
because Matrix
also supports linear indexing.
That is,
~
statement.Buuuut we can make this experience quite a bit less painful, which I very much agree with you @ElOceanografo is something we should do. And one way to do this is to use our custom structure VarNamedVector
that was recently introduced interally to store the values instead. That should make it less awkward to interact with this thing + cases such as x ~ filldist(Normal(), 2, 2)
would result in a structure where you could do opt[@varname(x)]
and get back a Matrix
:)
Thanks, that example clarifies the issue.
Even getting everything back as a flat vector would be an improvement (and I bet would cover most use cases, vectors are more common in stats models than arrays). If VarNamedVector
lets us get arrays back in their original shape, so much the better.
My only other comment is that this interface, whatever it ends up being, should ideally be consistent with the interface for MCMC chains. In fact, a group
method for ModeResult
s is pretty easy:
function Turing.group(opt::Turing.Optimisation.ModeResult, varname)
basenames = first.(split.(string.(names(opt.values)[1]), "["))
idx = findall(x -> x == string(varname), basenames)
return opt.values[idx]
end
I also was using this function in a script recently:
function named2component(v::Turing.NamedArrays.NamedVector)
v = opt.values
symbols = names(v)[1]
strings = first.(split.(string.(symbols), "["))
uniquestrings = unique(strings)
nt = NamedTuple([Symbol(s) => vec(v[strings .== s]) for s in uniquestrings])
return ComponentArray(nt)
end
These will both flatten 2D or higher-dimensional arrays, of course. Still, might be useful to other people reading this...
When doing an MLE or MAP optimization, the resulting
ModeResult.values
is aNamedArray
, from NamedArrays.jl. If the model contains any array-valued parameters, this can be a real pain to work with: each individual element is indexed with its own unique symbol, requiring the user to do some annoying data munging to get them all together. For instance, to extract a matrix calledx
that is supposed to have size 20 x 10:Maybe there's a better way to do this that I don't know about, but wouldn't it be simpler to return the results as a
ComponentArray
, a data structure designed for just this use case? That would let you simply do:Is there any compelling reason to prefer a
NamedArray
here?