Adding output_format options for InferenceObjects

StanJulia / StanSample.jl

WIP: Wrapper package for the sample method in Stan's cmdstan executable.

MIT License

18 stars 4 forks source link

Adding output_format options for InferenceObjects #60

Closed sethaxen closed 1 year ago

sethaxen commented 2 years ago

InferenceObjects.jl implements the Dataset and InferenceData contains for outputs of Bayesian inference. These are the storage types used by ArviZ.jl (Python ArviZ also implements an InferenceData using the same spec, which is used by PyMC), and there's discussion about ultimately using InferenceData as an official container for sampling results in Turing (see https://github.com/TuringLang/MCMCChains.jl/issues/381). It would be convenient if StanSample supported this as an output format :inferencedata. From the current output formats, this would be fairly straightforward. e.g.

# the posteriordb part, getting model code and data
using PosteriorDB
posterior_name = "diamonds-diamonds"
pdb = database()
post = posterior(pdb, posterior_name)
model_data = Dict(string(k) => v for (k, v) in load_values(dataset(post)))
model_code = implementation(model(post), "stan")

# the stan part
using StanSample
stan_model = SampleModel(posterior_name, model_code)
rc = stan_sample(stan_model; data=model_data)
@assert success(rc)
stan_nts = read_samples(stan_model, :namedtuples)

# the inferenceobjects part
using InferenceObjects
idata = convert_to_inference_data(stan_nts)

Here's what these outputs look like:

julia> idata
InferenceData with groups:
  > posterior

julia> idata.posterior
Dataset with dimensions: 
  Dim{:b_dim_1} Sampled{Int64} Base.OneTo(24) ForwardOrdered Regular Points,
  Dim{:draw} Sampled{Int64} Base.OneTo(1000) ForwardOrdered Regular Points,
  Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points
and 4 layers:
  :b           Float64 dims: Dim{:b_dim_1}, Dim{:draw}, Dim{:chain} (24×1000×4)
  :Intercept   Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
  :sigma       Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
  :b_Intercept Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)

with metadata Dict{String, Any} with 1 entry:
  "created_at" => "2022-10-28T18:13:39.205"

Dataset is a DimensionalData.AbstractDimStack, and its variables are DimArrays.

Unlike the other output formats, InferenceData can store sampling statistics (like divergences and tree depth), data, predictions, and warmup draws, so this information could also be unpacked from stan_model into the InferenceData object to be more easily used in downstream analyses (here's an example with PyStan and the Python implementation of InferenceData: https://python.arviz.org/en/latest/getting_started/CreatingInferenceData.html#from-pystan)

There are more options that might be convenient for users (e.g. specifying dimension names and coordinates) that wouldn't fit into the read_samples interface, so it would probably still be a good idea to have a method with more options live somewhere, e.g. here, in ArviZ itself, or in a glue package (e.g. StanInferenceObjects.jl).

goedman commented 2 years ago

Hi Seth (@sethaxen), I'll add that. Initially maybe leave the current :dimarray version of DimensionalDate in StanSample and add a new option read_samples(model, :inferencedata).

Have been looking at PosteriorDB the last couple of days, in particular the current read functions, but also what it would take to have say a "Statistical Rethinking" and "Regression and Other Stories" database.

How stable is PosteriorDB's API? Are you planning to register it? Mainly asking because I work mostly in Pluto where it is easier to use registered packages. Admittedly, that also means frequent updates.

sethaxen commented 2 years ago

Hi Seth (@sethaxen), I'll add that. Initially maybe leave the current :dimarray version of DimensionalDate in StanSample and add a new option read_samples(model, :inferencedata).

Yes, I think it makes sense to leave :dimarray in. When you do add :inferencedata, I'd be happy to review the PR.

Are you planning to register [PosteriorDB]? Mainly asking because I work mostly in Pluto where it is easier to use registered packages. Admittedly, that also means frequent updates.

PosteriorDB's registration went through this morning, so you should be able to use it now in Pluto. What do you mean about frequent updates?

How stable is PosteriorDB's API?

I wouldn't call it stable right now for a few reasons:

It just had its first release and so far as I know has only been used by me. I imagine as it's used more, people will have useful gripes/suggestions.
I'm trying to move it to stan-devs. The devs there may request changes to the API to make it more consistent with the R and Python packages (though these package's APIs don't completely agree).
Currently posteriordb only contains Stan models and 1 PyMC3 model. I'm working on a design for what it would take to include Turing models. This is more complex and might require changes to the PosteriorDB.jl API or even the posteriordb spec itself.
I may change some of the function names to avoid collisions when the methods enter into scope. e.g. https://github.com/sethaxen/PosteriorDB.jl/issues/5

If I had to guess, most changes in the future will be 1-to-1 renaming of functions and of expanding the API around loading model implementations (currently implementation and implementation_names). But user input would be useful in making these decisions, so I'd appreciate any suggestions you have. I'd like for v0.2 to be a close-to-final API.

Have been looking at PosteriorDB the last couple of days, in particular the current read functions, but also what it would take to have say a "Statistical Rethinking" and "Regression and Other Stories" database.

The great thing is that if you can construct a database of models that follows the posteriordb schema, then not only can PosteriorDB.jl work with it just fine, but so can the corresponding R and Python packages, so you potentially have a larger base of users and contributors. Then you could have a Julia package that just reexports PosteriorDB.jl but handles downloading of that database as an artifact (similar to what PosteriorDB does) and a convenience function for loading it. I'd imagine a number of those models are either in posteriordb already or might be welcome additions to it. In terms of conforming to the posteriordb spec, currently only the R posteriordb package has functionality for adding to the database. I may add API functions for doing this to PosteriorDB.jl as well, but this is currently low priority.

goedman commented 2 years ago

Absolutely, before merging the inferencedata branch into StanSample master I’ll list you as a reviewer.

On the frequent updates, I’ve just found that when using Pluto I merge into master asap. Which is typically more often then I used to.

I agree a `read_samples(m::SampleModel, :inferencedata) for a single group is trivial.

I wonder if InferenceObjects have a method to add multiple groups? E.g. from your link to the PyStan examples they use az.from_cmdstan in one of the examples:

# Let's use .stan and .csv files created/saved by the CmdStanPy procedure

# glob string
posterior_glob = "sample_data/eight_school-*-[0-9].csv"
# list of paths
# posterior_list =  [
#     "sample_data/eight_school-*-1.csv",
#     "sample_data/eight_school-*-2.csv",
#     "sample_data/eight_school-*-3.csv",
#     "sample_data/eight_school-*-4.csv",
# ]

obs_data_path = "./eight_school.data.R"

[cmdstan_data](https://python.arviz.org/en/latest/api/generated/arviz.InferenceData.html#arviz.InferenceData) = [az.from_cmdstan](https://python.arviz.org/en/latest/api/generated/arviz.from_cmdstan.html#arviz.from_cmdstan)(
    posterior=posterior_glob,
    posterior_predictive="y_hat",
    observed_data=obs_data_path,
    observed_data_var="y",
    log_likelihood="log_lik",
    coords={"school": [np.arange](https://numpy.org/doc/stable/reference/generated/numpy.arange.html#numpy.arange)(eight_school_data["J"])},
    dims={
        "theta": ["school"],
        "y": ["school"],
        "log_lik": ["school"],
        "y_hat": ["school"],
        "theta_tilde": ["school"],
    },
)
[cmdstan_data](https://python.arviz.org/en/latest/api/generated/arviz.InferenceData.html#arviz.InferenceData)

In the eight schools example I would have:

julia> keys(stan_nts)
(:mu, :theta_tilde, :y_hat, :theta, :tau, :log_lik)

and then could use NamedTupleTools to split this object, e.g.:

@assert success(rc)
stan_nts = read_samples(m_schools, :namedtuples)
keys(stan_nts) |> display => (:mu, :theta_tilde, :y_hat, :theta, :tau, :log_lik)

post = NamedTupleTools.select(stan_nts, (:mu, :theta, :theta_tilde, :tau))
loglik = NamedTupleTools.select(stan_nts, (:log_lik))
pred = NamedTupleTools.select(stan_nts, (:y_hat))

# the inferenceobjects part
idata = convert_to_inference_data(post)

but then expect I would need an object update method like:

convert_to_inference_data!(idata, loglik; group=:log_likelihood) # Group name taken from the InferenceObjects SCHEMA

Definitely would prefer to pass in a single NT and separately specify which form groups and dimensions in groups.

Update: Guess from_namedtuple() would allow this and returns an InferenceData object? I'll play around with this.

goedman commented 2 years ago

Another thought I had is about the name of the method. Currently read_samples(m, :named tuples) is lower level than what is represented in an InferenceObjects object. I wonder if convert(InferenceObjects, m:SampleModel; ...) wouldn't be a better name. The ... would specify groups and dims, warmups, etc with reasonably chosen defaults.

goedman commented 2 years ago

In the branch inferencedata of StanSample I have created a combined InferenceData object containing:

InferenceData with groups:
  > posterior
  > posterior_predictive
  > log_likelihood
  > sample_stats

Using from_namedtuple() works well.

The code is in `./test/test_inferencedata/test_inferencedata.jl.

The next step is including warmups and then turn it into a proper function/method.

sethaxen commented 2 years ago

I wonder if InferenceObjects have a method to add multiple groups? E.g. from your link to the PyStan examples they use az.from_cmdstan in one of the examples:

Update: Guess from_namedtuple() would allow this and returns an InferenceData object? I'll play around with this.

Yes, from_namedtuple does this, e.g.:

julia> idata = from_namedtuple(stan_nts; posterior_predictive=:y_hat, log_likelihood=:log_lik)
InferenceData with groups:
  > posterior
  > posterior_predictive
  > log_likelihood

julia> keys(idata.posterior)
(:mu, :theta_tilde, :theta, :tau)

julia> keys(idata.posterior_predictive)
(:y_hat,)

julia> keys(idata.log_likelihood)
(:log_lik,)

In this case, what you'd really want is to rename :y_hat and :log_lik to :y within their group, which would usually be supported by the syntax of a special from_XXX method, but actually, I think we could expand the syntax to allow either of the following:

idata = from_namedtuple(stan_nts; posterior_predictive=:y_hat=>:y, log_likelihood=:log_lik=>:y)
idata = from_namedtuple(stan_nts; posterior_predictive=(y=:y_hat,), log_likelihood=(y=:log_lik,))

The former is inspired by DataFrames's selection syntax.

Another thought I had is about the name of the method. Currently read_samples(m, :named tuples) is lower level than what is represented in an InferenceObjects object. I wonder if convert(InferenceObjects, m:SampleModel; ...) wouldn't be a better name. The ... would specify groups and dims, warmups, etc with reasonably chosen defaults.

Yes, actually, this is what we generally do for converting different types. So each type that is convertible to an inference data usually has a from_XXX name (e.g. from_mcmcchains, from_namedtuple), etc, that has keyword arguments to fine-tune how the type is converted. Then convert_to_inference_data is overloaded for that type with suitable options for keyword arguments. Because one's group data might be in different formats (e.g. in Turing, observed data could be a NamedTuple, while posterior predictions are a Dict and posterior is an MCMCChains.Chains), when passing to convert_to_inference_data, one really can't control in fine detail how each group is converted, which is why from_XXX methods exist. 's a generic method convert(InferenceData, obj::Any; kwargs...) that forwards to convert_to_inference_data.

In terms of design, I would like to make from_XXX methods obsolete. Also, since conversion isn't really what's going on here, I'm considering renaming convert_to_inference_data as just inference_data. I would like devs to just be able to overload inference_data for their type, but I haven't landed on a design that didn't have more problems than the current one.

but then expect I would need an object update method like:
convert_to_inference_data!(idata, loglik; group=:log_likelihood) # Group name taken from the InferenceObjects SCHEMA

Currently InferenceData and Dataset use a NamedTuple storage, so they are immutable, but I'm working on a PR to give them Dict storage. Then one would use Base.merge! or setproperty! to add to an existing InferenceData.

sethaxen commented 2 years ago

In the branch inferencedata of StanSample I have created a combined InferenceData object containing:

The code is in `./test/test_inferencedata/test_inferencedata.jl.

Great! I'll take a look in the next few days. If you open a draft PR, I can comment directly on the code.

The next step is including warmups and then turn it into a proper function/method.

Currently from_namedtuple is still missing some of the warm-up groups, so I'll get those added to make this task easier.

EDIT: issue opened at https://github.com/arviz-devs/InferenceObjects.jl/issues/28

goedman commented 1 year ago

Thanks for your comments. The main parts shown below work. I added a few other options as a comment.

stan_nts = read_samples(m_schools, :namedtuples; include_internals=true)
keys(stan_nts) |> display

# (:treedepth__, :theta_tilde, :energy__, :y_hat, :divergent__, :accept_stat__, 
#   :n_leapfrog__, :mu, :lp__, :stepsize__, :tau, :theta, :log_lik)

function select_nt_ranges(nt::NamedTuple, ranges=[1:1000, 1001:2000])
    dct = convert(Dict, nt)
    dct1 = Dict{Symbol, Any}()
    for key in keys(dct)
        dct1[key] = dct[key][ranges[1]]
    end
    nt1 = namedtuple(dct1)
    dct2 = Dict{Symbol, Any}()
    for key in keys(dct)
        dct2[key] = dct[key][ranges[2]]
    end
    nt2 = namedtuple(dct2)
    [nt1, nt2]
end

post_warmup, post = select_nt_ranges(stan_nts) # Use "default" ranges from SampleModel

#y_hat_warmup, y_hat = select_nt_ranges(NamedTupleTools.select(stan_nts, (:y_hat,)))
#log_lik_warmup, log_lik = select_nt_ranges(NamedTupleTools.select(stan_nts, (:log_lik,)))
#internals_warmup, internals_nts = select_nt_ranges(NamedTupleTools.select(stan_nts,
#    (:treedepth__, :energy__, :divergent__, :accept_stat__, :n_leapfrog__, :lp__, :stepsize__)))

idata = from_namedtuple(
    post; 
    posterior_predictive = (:y_hat,), 
    log_likelihood = (:log_lik,), 
    sample_stats = (:lp__, :treedepth__, :stepsize__, :n_leapfrog__, :energy__, :divergent__, :accept_stat__),
    #warmup_posterior = post_warmup
)

# What would be ideal:

#=
idata = from_namedtuple(
    stan_nts; 
    posterior = (keys=(:mu, :theta, :theta_tilde, :tau), range=1001:2000),
    posterior_predictive = (keys=(:y_hat => :y,), range=1001:2000),
    log_likelihood = (keys=(:log_lik => :y,), range=1001:2000),
    sample_stats = (
        keys=(:lp__, :treedepth__, :stepsize__, :n_leapfrog__, :energy__, :divergent__, :accept_stat__),
        range=1001:2000),

    #warmup_posterior = (keys=(:mu, :theta, :theta_tilde, :tau), range=1:2000),
    # etc.
)
=#

# With a Dict based InferenceData object a similar result is possible with above `select_nt_ranges()` calls.

println()
idata |> display
println()
idata.posterior |> display
println()
idata.posterior_predictive |> display
println()
idata.log_likelihood |> display
println()
idata.sample_stats |> display

goedman commented 1 year ago

Hi Seth, I need to study DimensionalData further. Above ends up with the wrong dimensions in InferenceData and also drops most of log_lik. I'll dig a bit deeper the next couple of days. Simply calling convert_to_inference_data(stan_nts) comes close but without proper groups.

sethaxen commented 1 year ago

@goedman looks great so far! Yeah I see that the dimensions are wrong. This seems to be due to how select_nt_ranges is selecting, e.g.

julia> stan_nts.theta_tilde |> size
(8, 2000, 4)

julia> post.theta_tilde |> size
(1000,)

So the function is discarding the chains dimension and the leading dimension(s). In InferenceData, for sample-based groups, a vector is interpreted as a single draw with many chains. This should probably change (we recently changed the default dimension ordering in InferenceObjects to be more Julian), hence why you're getting a single draw but 1000 chains.

This can absolutely be worked around without DimensionalData; ideally no DimensionalData knowledge is needed to work with InferenceObjects; knowing it just unlocks more functionality. Since you know the draws are indexed starting at 1, you can do this to separate the posterior draws from the warmup:

julia> idata2 = let
           idata_warmup = idata[draw=1:1000]
           idata_postwarmup = idata[draw=1001:2000]
           idata_warmup_rename = InferenceData(NamedTuple(Symbol("warmup_$k") => idata_warmup[k] for k in keys(idata_warmup)))
           merge(idata_postwarmup, idata_warmup_rename)
       end
InferenceData with groups:
  > posterior
  > posterior_predictive
  > log_likelihood
  > sample_stats
  > warmup_posterior
  > warmup_posterior_predictive
  > warmup_sample_stats
  > warmup_log_likelihood

julia> idata2.posterior
Dataset with dimensions: 
  Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
  Dim{:draw} Sampled{Int64} 1001:2000 ForwardOrdered Regular Points,
  Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
  Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
  :theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
  :mu          Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
  :tau         Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
  :theta       Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)

with metadata Dict{String, Any} with 1 entry:
  "created_at" => "2022-11-03T14:46:48.145"

julia> idata2.warmup_posterior
Dataset with dimensions: 
  Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
  Dim{:draw} Sampled{Int64} 1:1000 ForwardOrdered Regular Points,
  Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
  Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
  :theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
  :mu          Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
  :tau         Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
  :theta       Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)

with metadata Dict{String, Any} with 1 entry:
  "created_at" => "2022-11-03T14:46:48.145"

I would suggest resetting the draw indices post-warmup to count from 1. This probably requires using the DimensionalData API, and I'll look up how to do that.

One more note is that the InferenceData schema does have some rules about what certain sampling statistics should be named. This list includes all of the usual Stan sample statistics, so to completely comply with the spec, these parameters should be renamed. ArviZ.jl already defines this map: https://github.com/arviz-devs/ArviZ.jl/blob/1f642377ec01b9f5ef4d6ebd164604f65edf79de/src/mcmcchains.jl#L9-L17, so you could do:

julia> stan_key_map = (
           n_leapfrog__=:n_steps,
           treedepth__=:tree_depth,
           energy__=:energy,
           lp__=:lp,
           stepsize__=:step_size,
           divergent__=:diverging,
           accept_stat__=:acceptance_rate,
       );

julia> sample_stats_rekey = InferenceObjects.Dataset((; (stan_key_map[k] => idata.sample_stats[k] for k in keys(idata.sample_stats))...));

julia> idata2 = merge(idata, InferenceData(; sample_stats=sample_stats_rekey))
InferenceData with groups:
  > posterior
  > posterior_predictive
  > log_likelihood
  > sample_stats

julia> idata2.sample_stats
Dataset with dimensions: 
  Dim{:draw} Sampled{Int64} Base.OneTo(2000) ForwardOrdered Regular Points,
  Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points
and 7 layers:
  :lp              Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
  :tree_depth      Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
  :step_size       Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
  :n_steps         Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
  :energy          Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
  :diverging       Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
  :acceptance_rate Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)

A few remaining things you could do would be:

Add the observed_data and constant_data groups. This would be in the data files, but you would need the user to specify which variables go where by name. e.g. in arviz.from_cmdstan users can provide observed_data_var or constant_data_var. Similar options could be available for log-likelihood, but the question is whether you want to support these keyword arguments in read_samples or not.
Add a keyword argument for whether or not to include warmup draws in the returned object. arviz uses save_warmup, but include_warmup may be more appropriate here.
Add Stan-specific metadata. e.g. you could set inference_library to "Stan" (or "StanSample, or whatever you think is most descriptive), and even set a version numberinference_library_version`. Perhaps there's other metadata that the user might find interesting.

# What would be ideal:

#=
idata = from_namedtuple(
    stan_nts; 
    posterior = (keys=(:mu, :theta, :theta_tilde, :tau), range=1001:2000),
    posterior_predictive = (keys=(:y_hat => :y,), range=1001:2000),
    log_likelihood = (keys=(:log_lik => :y,), range=1001:2000),
    sample_stats = (
        keys=(:lp__, :treedepth__, :stepsize__, :n_leapfrog__, :energy__, :divergent__, :accept_stat__),
        range=1001:2000),

    #warmup_posterior = (keys=(:mu, :theta, :theta_tilde, :tau), range=1:2000),
    # etc.
)
=#

I'd prefer foregoing the specialized keys and range keywords, since this could collide if the user has variables named keys and range. But I do think we can provide a syntax accepting a tuple of Symbols to select variables for those groups.

Alternatively, the syntax in https://github.com/StanJulia/StanSample.jl/issues/60#issuecomment-1297770614 would capture this as well as allow for renaming of variables:

idata = from_namedtuple(stan_nts; posterior_predictive=:y_hat=>:y, log_likelihood=:log_lik=>:y)
idata = from_namedtuple(stan_nts; posterior_predictive=(y=:y_hat,), log_likelihood=(y=:log_lik,))

Do you have a preference for either of these syntaxes?

sethaxen commented 1 year ago

I would suggest resetting the draw indices post-warmup to count from 1. This probably requires using the DimensionalData API, and I'll look up how to do that.

Here's how to do this:

julia> using DimensionalData

julia> idata3 = InferenceData(map(NamedTuple(idata2)) do ds
           DimensionalData.set(ds; draw=axes(ds, :draw))
       end)
InferenceData with groups:
  > posterior
  > posterior_predictive
  > log_likelihood
  > sample_stats
  > warmup_posterior
  > warmup_posterior_predictive
  > warmup_sample_stats
  > warmup_log_likelihood

julia> idata3.posterior
Dataset with dimensions: 
  Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
  Dim{:draw} Sampled{Int64} Base.OneTo(1000) ForwardOrdered Regular Points,
  Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
  Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
  :theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
  :mu          Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
  :tau         Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
  :theta       Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)

with metadata Dict{String, Any} with 1 entry:
  "created_at" => "2022-11-03T19:29:57.427"

julia> idata3.warmup_posterior
Dataset with dimensions: 
  Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
  Dim{:draw} Sampled{Int64} Base.OneTo(1000) ForwardOrdered Regular Points,
  Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
  Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
  :theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
  :mu          Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
  :tau         Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
  :theta       Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)

with metadata Dict{String, Any} with 1 entry:
  "created_at" => "2022-11-03T19:29:57.427"

sethaxen commented 1 year ago

Oh, also, it might be good to allow users to specify dims and coords, but again, only if read_samples accepts keyword arguments.

sethaxen commented 1 year ago

@goedman I took some time tonight to rethink the conversion pipeline to InferenceData: https://github.com/arviz-devs/InferenceObjects.jl/issues/32.

With this pipeline, you would implement inferencedata(::SampleModel; kwargs...) and then read_samples(model::SampleModel, :inferencedata) would dispatch to this method. You wouldn't need to worry about dims, coords, or subsetting variables for different groups. If the user needs those things, they would call inferencedata instead of read_samples, and they would provide the desired dims, coords, and subsetting.

goedman commented 1 year ago

Hi @sethaxen

Implemented your first 2 suggestions and that works great! By default, if a model contains the warmup samples read_samples(m::SampleModel, :inferencedata) will split warmup samples from the posterior draws.

I'll use the renaming option to rename y_hat to y and log_lik to y as you suggested earlier and also look into the posterior indices (to start from 1).

I can easily add kwargs to read_sample() if that is easier/more flexible to handle passing arguments to inferencedata(...).

goedman commented 1 year ago

Hi @sethaxen

Just pushed a very first attempt at inferencedata(). In the corresponding test_inferencedata() for now I derive the include_warmup settings from the SampleModel. But in principle I would tend to include what the SampleModel has produced as the work is done already.

Are there any rules/recommendations on what should be stored in inferencedata? In addition to warmup values, many models might not have y_hat and/or log_lik generated in generated_quantities.

Currently read_samples() by default drops internals but that is currently overridden in inferencedata().

For y_hat and log_lik I think I can use merge as you showed above (and used in the current version of inferencedata() for warmup) before separating out the warmup sections.

On your question above, I think I prefer:

idata = from_namedtuple(stan_nts; posterior_predictive=:y_hat=>:y, log_likelihood=:log_lik=>:y)

but for now I've used your setup for remapping keys.

sethaxen commented 1 year ago

Just pushed a very first attempt at inferencedata().

Great! Is that at #61? If so I'll take a closer look.

Just pushed a very first attempt at inferencedata(). In the corresponding test_inferencedata() for now I derive the include_warmup settings from the SampleModel. But in principle I would tend to include what the SampleModel has produced as the work is done already.

That makes sense!. The best argument I can think of for not loading warmup is that for models with many parameters, this might double the memory requirements of loading the draws, since even though the work is done, until loading, the draws are stored on disk. Also, only in rare cases will users need to inspect the warm-up draws for diagnostic purposes.

Are there any rules/recommendations on what should be stored in inferencedata? In addition to warmup values, many models might not have y_hat and/or log_lik generated in generated_quantities.

The InferenceData spec gives some instructions. The non-MCMC groups are observed_data, constant_data, and predictions_constant_data. In general, in Stan these 3 would comprise subsets of the data that the user would need to specify. The MCMC groups are posterior, sample_stats, and the downstream groups posterior_predictive,log_likelihood, andpredictions. In Stan, these would all be either sampled parameters or generated quantities (which I think Stan lumps together in the outputs?), so the user would likewise need to specify which of the sampled parameters should go in which group. If the user provides no such parameter names, then everything would go intoposterior`.

There's also prior and prior_predictive, but Stan doesn't care if you're drawing from the prior or posterior, so I think for now it makes sense to assume everything is part of the posterior or these downstream groups and then the user can rename the posterior to prior if they know their model was drawing from the prior with MCMC. In the rewrite of the pipeline I'm working on, this should be easier.

goedman commented 1 year ago

On my flight back from Amsterdam to Colorado I made several more changes to Inferencedata.jl in StanSample/utils. All in #61 on the InferenceData branch. I'll go though the InferenceData spec today.

For now, I have removed PosteriorDB from the StanSample.jl inference data branch. I think it belongs on the level of the Stan.jl package and the StatisticalRethinking and RegressionAndOtherStories projects. And I would love to use it there.

goedman commented 1 year ago

Thanks for the feedback @sethaxen !

I pushed a couple of updates to the inferencedata branch that use your suggestions. It furthermore simplified handling of the names used for the posterior_predictive and log_likelihood. This way I don't think we need to use remap on the inferencedata object.

The new inferencedata function looks like:

function inferencedata(m::SampleModel;
    include_warmup = m.save_warmup,
    log_likelihood_symbol::Union{Nothing, Symbol} = :log_lik,
    posterior_predictive_symbol::Union{Nothing, Symbol} = :y_hat,
    kwargs...)

    # Read in the draws as a NamedTuple with
    stan_nts = read_samples(m, :namedtuples; include_internals=true)

    # Define the "proper" ArviZ names for the sample statistics group.
    sample_stats_key_map = (
        n_leapfrog__=:n_steps,
        treedepth__=:tree_depth,
        energy__=:energy,
        lp__=:lp,
        stepsize__=:step_size,
        divergent__=:diverging,
        accept_stat__=:acceptance_rate,
    );

    # If a log_likelihood_symbol is defined (!= nothing), remove it from the future posterior group
    if !isnothing(log_likelihood_symbol)
        sample_nts = NamedTuple{filter(∉([log_likelihood_symbol]), keys(stan_nts))}(stan_nts)
    end

    # If a posterior_predictive_symbol is defined (!= nothing), remove it from the future posterior group
    if !isnothing(posterior_predictive_symbol)
        sample_nts = NamedTuple{filter(∉([posterior_predictive_symbol]), keys(sample_nts))}(sample_nts)
    end

    # `sample_nts` now holds remaining parameters and the sample statistics
    # Split in 2 separate NamedTuples: posterior_nts and sample_stats_nts
    posterior_nts = NamedTuple{filter(∉(keys(sample_stats_key_map)), keys(sample_nts))}(sample_nts)
    sample_stats_nts = NamedTuple{filter(∈(keys(sample_stats_key_map)), keys(sample_nts))}(sample_nts)

    # Remap the names according to above sample_stats_key_map
    sample_stats_nts_rekey = 
        NamedTuple{map(Base.Fix1(getproperty, sample_stats_key_map), keys(sample_stats_nts))}(
            values(sample_stats_nts))

    # Create initial inferencedata object with 2 groups
    idata = from_namedtuple(sample_nts; sample_stats=sample_stats_nts_rekey)

    # Merge log_likelihood and posterior_predictive groups into idata
    if posterior_predictive_symbol in keys(stan_nts)
        nt = (y = stan_nts[posterior_predictive_symbol],)
        idata = merge(idata, from_namedtuple(nt; posterior_predictive = (:y,)))
    end

    if log_likelihood_symbol in keys(stan_nts)
        nt = (y = stan_nts[log_likelihood_symbol],)
        idata = merge(idata, from_namedtuple(nt; log_likelihood = (:y,)))
    end

    # Extract warmup values in separate groups
    if include_warmup
        idata = let
            idata_warmup = idata[draw=1:1000]
            idata_postwarmup = idata[draw=1001:2000]
            idata_warmup_rename = InferenceData(NamedTuple(Symbol("warmup_$k") => idata_warmup[k] for k in
                keys(idata_warmup)))
            merge(idata_postwarmup, idata_warmup_rename)
        end
    end

    # TO DO: update the indexing

    # TO DO: add other groups (data, etc.)

    return idata
end

goedman commented 1 year ago

Hi Seth,

Think I'm doing something wrong here. After creating above InferenceData object, I'm trying to add the observed _data:

nt = (sigma = [15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0], J = 8, y = [28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0])

From the InferenceObjects code I see convert, convert_to_inference_data and from_namedtuple. I tried several of these functions, e.g.:

    ds = namedtuple_to_dataset(data; kwargs...)
    convert_to_inference_data(ds; group)

from_namedtuple(nt; observed_data=keys(nt))

All of these fail, often with stack overflow error. But haven't figured out how to do it correctly.

sethaxen commented 1 year ago

Hi @goedman, it seems you found an InferenceObjects bug. When https://github.com/arviz-devs/InferenceObjects.jl/pull/36 is finished, it should fix this.

goedman commented 1 year ago

Great. I think I did try replacing .., J = 4, ... by ..., J = [4], ..., but I tried many things.

Is my line of thought correct? I.e. merge(idata, from_namedtuple(nt; observed_data=keys(nt))) will add the observed_data group (from nt) to an existing idata InferenceData object?

Edit: I think when I updated it to [4] it complained about different length of draw or chain dimension. I read somewhere that check is not done for observed_data and constant_data.

sethaxen commented 1 year ago

Is my line of thought correct? I.e. merge(idata, from_namedtuple(nt; observed_data=keys(nt))) will add the observed_data group (from nt) to an existing idata InferenceData object?

It's better to pass the observed data directly, e.g. using from_namedtuple(; observed_data=nt). The first argument of from_namedtuple corresponds to the posterior, so by passing nt in first, you're informing the function that it should expect chain and draw dimensions on all passed parameters, which is why it raises a dimension error. Passing keys is only safe for parameter derived from the prior or posterior.

I read somewhere that check is not done for observed_data and constant_data.

That's right, there are 3 groups that are special-cased (also IIRC predictions_constant_data) to not assume chain and draw dimensions.

goedman commented 1 year ago

Hi Seth

_It's better to pass the observed data directly, e.g. using from_namedtuple(; observed_data=nt). The first argument of fromnamedtuple corresponds to the posterior, so by passing nt in first, you're informing the function that it should expect chain and draw dimensions on all passed parameters, which is why it raises a dimension error. Passing keys is only safe for parameter derived from the prior or posterior.

Aah, I’d missed that. With a workaround for issue #36:

    nt = namedtuple(data)

    # Until InferenceObjects issue #36 is merged
    ntu=(sigma=nt.sigma, J=[4], y=nt.y)

    idata = merge(idata, from_namedtuple(; observed_data = ntu))

works fine.

I’ll merge the branch inferencedata into master. I’m sure we’ll make more changes over the coming weeks but I would like to have the current version available in Pluto to see how that works out. Earlier a quick test showed DimensionalData displays well.

Rob

Rob J Goedman @.***

goedman commented 1 year ago

Hi Seth,

I created a slightly modified version on inferencedata (inferencedata2() for now). This corrects the indices for all draws to 1:model.numsamples. I wasn't able to get this done on the DimensionalData level but it is easily achieved by temporarily creating Dicts instead of NamedTuples:

function inferencedata2(m::SampleModel;
    include_warmup = m.save_warmup,
    log_likelihood_symbol::Union{Nothing, Symbol} = :log_lik,
    posterior_predictive_symbol::Union{Nothing, Symbol} = :y_hat,
    kwargs...)

    # Read in the draws as a NamedTuple with sample_stats included
    stan_nts = read_samples(m, :namedtuples; include_internals=true)

    # Convert to a Dict and split into draws and warmup Dicts 
    # When creating the new Dicts, update sample_stats names
    dicts = convert(Dict, stan_nts)
    draw_dict = Dict{Symbol, Any}()
    warmup_dict = Dict{Symbol, Any}()
    if include_warmup
        for key in keys(dicts)
            if length(size(dicts[key])) == 1
                warmup_dict[arviz_names(key)] = dicts[key][1:m.num_warmups]
                draw_dict[arviz_names(key)] = dicts[key][(m.num_warmups+1):end]
            elseif length(size(dicts[key])) == 2
                warmup_dict[arviz_names(key)] = dicts[key][1:m.num_warmups, :]
                draw_dict[arviz_names(key)] = dicts[key][(m.num_warmups+1):end, :]
            elseif length(size(dicts[key])) == 3
                warmup_dict[arviz_names(key)] = dicts[key][:, 1:m.num_warmups, :]
                draw_dict[arviz_names(key)] = dicts[key][:, (m.num_warmups+1):end, :]
            end
        end
    end

    draw_nts = namedtuple(draw_dict)
    warmup_nts = namedtuple(warmup_dict)
    @assert keys(draw_nts) == keys(warmup_nts)

    # If a log_likelihood_symbol is defined, remove it from the future posterior groups
    if !isnothing(log_likelihood_symbol)
        sample_nts = NamedTuple{filter(∉([log_likelihood_symbol]), keys(draw_nts))}(draw_nts)
        warm_nts = NamedTuple{filter(∉([log_likelihood_symbol]), keys(warmup_nts))}(warmup_nts)
    else
        sample_nts = draw_nts
        warm_nts = warmup_nts
    end

    # If a posterior_predictive_symbol is defined, remove it from the future posterior group
    if !isnothing(posterior_predictive_symbol)
        sample_nts = NamedTuple{filter(∉([posterior_predictive_symbol]), keys(sample_nts))}(sample_nts)
        warm_nts = NamedTuple{filter(∉([posterior_predictive_symbol]), keys(warm_nts))}(warm_nts)
    end

    # `sample_nts` and `warm_nts` now holds remaining parameters and the sample statistics

    # ArviZ names for the sample statistics group
    # Remove from posteriors groups and store in sample_stats groups
    sample_stats_keys = (:n_steps, :tree_depth, :energy, :lp, :step_size, :diverging, :acceptance_rate)

    # Split both draws and warmup into 2 separate NamedTuples: posterior_nts and sample_stats_nts
    posterior_nts = NamedTuple{filter(∉(sample_stats_keys), keys(sample_nts))}(sample_nts)
    warmup_posterior_nts = NamedTuple{filter(∉(sample_stats_keys), keys(warm_nts))}(warm_nts)

    sample_stats_nts = NamedTuple{filter(∈(sample_stats_keys), keys(sample_nts))}(sample_nts)
    warmup_sample_stats_nts = NamedTuple{filter(∈(sample_stats_keys), keys(warm_nts))}(warm_nts)

    # Create initial inferencedata object with 2 groups (posterior and sample_stats)
    idata = from_namedtuple(posterior_nts; sample_stats=sample_stats_nts, kwargs...)

    # Merge both log_likelihood and posterior_predictive groups into idata if present
    # Note that log_likelihood and predictive_posterior NamedTuples are obtained from
    # draw_nts and warmup_nts directly and in the process being renamed to :y
    if !isnothing(posterior_predictive_symbol) && posterior_predictive_symbol in keys(stan_nts)
        nt = (y = draw_nts[posterior_predictive_symbol],)
        idata = merge(idata, from_namedtuple(nt; posterior_predictive = (:y,)))
    end

    if !isnothing(log_likelihood_symbol) log_likelihood_symbol in keys(stan_nts)
        nt = (y = draw_nts[log_likelihood_symbol],)
        idata = merge(idata, from_namedtuple(nt; log_likelihood = (:y,)))
    end

    # Add warmup groups if so desired
    if include_warmup
        # Create initial warmup inferencedata object with 2 groups
        idata_warmup = from_namedtuple(
            warmup_posterior_nts,
            sample_stats=warmup_sample_stats_nts, 
            kwargs...)

        # Merge both log_likelihood and posterior_predictive groups into idata_warmup if present
        if !isnothing(posterior_predictive_symbol) && posterior_predictive_symbol in keys(stan_nts)
            nt = (y = warmup_nts[posterior_predictive_symbol],)
            idata_warmup = merge(idata_warmup, from_namedtuple(; posterior_predictive = nt, kwargs...))
        end

        if !isnothing(log_likelihood_symbol) log_likelihood_symbol in keys(stan_nts)
            nt = (y = warmup_nts[log_likelihood_symbol],)
            idata_warmup = merge(idata_warmup, from_namedtuple(; log_likelihood = nt, kwargs...))
        end

        idata_warmup_rename = InferenceData(NamedTuple(Symbol("warmup_$k") => idata_warmup[k] for k in keys(idata_warmup)))
        idata = merge(idata, idata_warmup_rename)
    end

    return idata
end

Currently I'm testing this version in Pluto notebooks.

sethaxen commented 1 year ago

Hi @goedman, sorry, I missed this comment. I'll take a look at the latest implementation tonight and open a PR with any potential improvements I see.

goedman commented 1 year ago

Hi Seth ( @sethaxen ) the current Dict based version (inferencedat3()) is intended to be based on a Dict based read_samples(). That way I can separate warmup and draws directly while reading in the samples.

goedman commented 1 year ago

Hi Seth (@sethaxen)

Sorry for the late reply but your updates and v0.2.6 seem to work great including scalar values in the data. Got distracted by https://github.com/roualdes/bridgestan/issues/62#issue-1488505948.

I don't think https://github.com/arviz-devs/InferenceObjects.jl/pull/40#issue-1467747122 has been merged yet.

sethaxen commented 1 year ago

Sorry for the late reply but your updates and v0.2.6 seem to work great including scalar values in the data.

No problem, and great!

I don't think arviz-devs/InferenceObjects.jl#40 (comment) has been merged yet.

It's now been merged. RE whether to use permutedims or PermuteDims to adopt the new dimension ordering, I recommend permutedims (allocates a permuted copy), since downstream operations such as diagnostics should in general be faster if the actual memory layout corresponds to the new dimension ordering.

goedman commented 1 year ago

Hi Seth (@sethaxen) Works fine, just merged in StanSample.jl v6.13.8, example notebook will be in Stan v9.10.5.

sethaxen commented 1 year ago

Hi Seth (@sethaxen) Works fine, just merged in StanSample.jl v6.13.8, example notebook will be in Stan v9.10.5.

Great! I'll open a PR in the next few days to support extracting observed_data and constant_data, and then I think we can close this issue!