TuringLang / Turing.jl

Bayesian inference with probabilistic programming.
https://turinglang.org
MIT License
2.04k stars 218 forks source link

`ComponentArray` instead of `NamedArray` for mode estimation results? #2363

Open ElOceanografo opened 1 week ago

ElOceanografo commented 1 week ago

When doing an MLE or MAP optimization, the resulting ModeResult.values is a NamedArray, from NamedArrays.jl. If the model contains any array-valued parameters, this can be a real pain to work with: each individual element is indexed with its own unique symbol, requiring the user to do some annoying data munging to get them all together. For instance, to extract a matrix called x that is supposed to have size 20 x 10:

...
opt = maximum_likelihood(mymodel)
x_opt = reshape(opt.values[first.(split.(string.(names(opt.values)[1]), "[")) .== "x"], 20, 10)
...

Maybe there's a better way to do this that I don't know about, but wouldn't it be simpler to return the results as a ComponentArray, a data structure designed for just this use case? That would let you simply do:

x_opt = opt.values.x

Is there any compelling reason to prefer a NamedArray here?

mhauru commented 1 week ago

Thanks for raising this. I agree that the return value of mode estimation could and should be improved. ComponentArrays would be a good candidate. There are some other elements of the return value that we would also like to change, mentioned in #2232.

Putting this on our todo list, though unsure when we'll get to it.

ElOceanografo commented 6 days ago

If there's interest in the meantime, I'd be willing to make a PR for this feature. It doesn't look like it requires many code changes and would be a huge increase in usability, at least for me...

torfjelde commented 5 days ago

The issue with ComponentArrays.jl is that it effectively takes a NamedTuple approach. This has the benefit that it's very fast to do something like x_opt.x (+ you can infer the type it returns, so you can do more interesting things than just returning Vector{Float64} without ruining type inference), but it's really not the ideal thing to do when you have tons of variables. As an example, if I wrote the following model

@model function demo()
    x = Vector(undef, 100)
    for i in eachindex(x)
        x[i] ~ Normal()
    end
end

If you "naively" convert the variables here as they occur in the model, x[1], x[2], etc., you'll end up with a "namedtuple" representation with 100 entries, which is really not great. NamedArrays.jl does however NOT do this, but instead uses a less performant but more "flexible" on the index side since it just uses an array of strings.

In short, they each have respective pros and cons, and we should probably use a "customized" solution for this.

An alternative route @mhauru is to maybe just use VarNamedVector for this.

ElOceanografo commented 4 days ago

I'm not sure I quite understand--the whole point would be to get that array x out as an array, not as 100 individual variables named x[1], x[2], etc.

Are you saying that in a model like this,

@model function demo()
    a ~ Normal()
    b ~ Normal()
    ...
    z ~ Normal()
    a1 ~ Normal()
    ...
    z1 ~ Normal()
    ...
    a100 ~ Normal()
    ...
    z100 ~ Normal()
end

where there are many, many individually named variables (2,600 in this case) the performance of a NamedTuple with entries for all of them will suffer?

Ultimately, it doesn't actually matter whether this hypothetical return structure is a ComponentArray or something else, as long as it doesn't require the user to do the kind of annoying and error-prone manual extraction I showed in my first example.

torfjelde commented 3 days ago

where there are many, many individually named variables (2,600 in this case) the performance of a NamedTuple with entries for all of them will suffer?

Yep.

But a more likely scenario is the model I mentioned above, i.e.

model function demo()
    x = Vector(undef, 100)
    for i in eachindex(x)
        x[i] ~ Normal()
    end
end

The problem is that x[i] is the only thing that is "seen" by Turing.jl. Turing.jl doesn't "care" about the fact that x is a Vector; for example, the following model would result in exactly the same variables x[1], x[2], ... in the chain:

model function demo()
    x = Matrix(undef, 1, 100)
    for i in eachindex(x)
        x[i] ~ Normal()
    end
end

because Matrix also supports linear indexing.

That is,

  1. Turing.jl doesn't know the "original" shape of the variable; Turing.jl only sees the result of the expression on the left-hand side of ~ statement.
  2. Using ComponentArrays.jl wouldn't fix this, as you would still have to figure out the reshaping yourself.

Buuuut we can make this experience quite a bit less painful, which I very much agree with you @ElOceanografo is something we should do. And one way to do this is to use our custom structure VarNamedVector that was recently introduced interally to store the values instead. That should make it less awkward to interact with this thing + cases such as x ~ filldist(Normal(), 2, 2) would result in a structure where you could do opt[@varname(x)] and get back a Matrix:)

ElOceanografo commented 3 days ago

Thanks, that example clarifies the issue.

Even getting everything back as a flat vector would be an improvement (and I bet would cover most use cases, vectors are more common in stats models than arrays). If VarNamedVector lets us get arrays back in their original shape, so much the better.

My only other comment is that this interface, whatever it ends up being, should ideally be consistent with the interface for MCMC chains. In fact, a group method for ModeResults is pretty easy:

function Turing.group(opt::Turing.Optimisation.ModeResult, varname)
    basenames = first.(split.(string.(names(opt.values)[1]), "["))
    idx = findall(x -> x == string(varname), basenames)
    return opt.values[idx]
end

I also was using this function in a script recently:

function named2component(v::Turing.NamedArrays.NamedVector)
    v = opt.values
    symbols = names(v)[1]
    strings = first.(split.(string.(symbols), "["))
    uniquestrings = unique(strings)
    nt = NamedTuple([Symbol(s) => vec(v[strings .== s]) for s in uniquestrings])
    return ComponentArray(nt)
end

These will both flatten 2D or higher-dimensional arrays, of course. Still, might be useful to other people reading this...