Closed sethaxen closed 1 year ago
Hi Seth (@sethaxen), I'll add that. Initially maybe leave the current :dimarray version of DimensionalDate in StanSample and add a new option read_samples(model, :inferencedata)
.
Have been looking at PosteriorDB the last couple of days, in particular the current read functions, but also what it would take to have say a "Statistical Rethinking" and "Regression and Other Stories" database.
How stable is PosteriorDB's API? Are you planning to register it? Mainly asking because I work mostly in Pluto where it is easier to use registered packages. Admittedly, that also means frequent updates.
Hi Seth (@sethaxen), I'll add that. Initially maybe leave the current :dimarray version of DimensionalDate in StanSample and add a new option
read_samples(model, :inferencedata)
.
Yes, I think it makes sense to leave :dimarray
in. When you do add :inferencedata
, I'd be happy to review the PR.
Are you planning to register [PosteriorDB]? Mainly asking because I work mostly in Pluto where it is easier to use registered packages. Admittedly, that also means frequent updates.
PosteriorDB's registration went through this morning, so you should be able to use it now in Pluto. What do you mean about frequent updates?
How stable is PosteriorDB's API?
I wouldn't call it stable right now for a few reasons:
If I had to guess, most changes in the future will be 1-to-1 renaming of functions and of expanding the API around loading model implementations (currently implementation
and implementation_names
). But user input would be useful in making these decisions, so I'd appreciate any suggestions you have. I'd like for v0.2 to be a close-to-final API.
Have been looking at PosteriorDB the last couple of days, in particular the current read functions, but also what it would take to have say a "Statistical Rethinking" and "Regression and Other Stories" database.
The great thing is that if you can construct a database of models that follows the posteriordb schema, then not only can PosteriorDB.jl work with it just fine, but so can the corresponding R and Python packages, so you potentially have a larger base of users and contributors. Then you could have a Julia package that just reexports PosteriorDB.jl but handles downloading of that database as an artifact (similar to what PosteriorDB does) and a convenience function for loading it. I'd imagine a number of those models are either in posteriordb already or might be welcome additions to it. In terms of conforming to the posteriordb spec, currently only the R posteriordb package has functionality for adding to the database. I may add API functions for doing this to PosteriorDB.jl as well, but this is currently low priority.
Absolutely, before merging the inferencedata branch into StanSample master I’ll list you as a reviewer.
On the frequent updates, I’ve just found that when using Pluto I merge into master asap. Which is typically more often then I used to.
I agree a `read_samples(m::SampleModel, :inferencedata) for a single group is trivial.
I wonder if InferenceObjects have a method to add multiple groups? E.g. from your link to the PyStan examples they use az.from_cmdstan
in one of the examples:
# Let's use .stan and .csv files created/saved by the CmdStanPy procedure
# glob string
posterior_glob = "sample_data/eight_school-*-[0-9].csv"
# list of paths
# posterior_list = [
# "sample_data/eight_school-*-1.csv",
# "sample_data/eight_school-*-2.csv",
# "sample_data/eight_school-*-3.csv",
# "sample_data/eight_school-*-4.csv",
# ]
obs_data_path = "./eight_school.data.R"
[cmdstan_data](https://python.arviz.org/en/latest/api/generated/arviz.InferenceData.html#arviz.InferenceData) = [az.from_cmdstan](https://python.arviz.org/en/latest/api/generated/arviz.from_cmdstan.html#arviz.from_cmdstan)(
posterior=posterior_glob,
posterior_predictive="y_hat",
observed_data=obs_data_path,
observed_data_var="y",
log_likelihood="log_lik",
coords={"school": [np.arange](https://numpy.org/doc/stable/reference/generated/numpy.arange.html#numpy.arange)(eight_school_data["J"])},
dims={
"theta": ["school"],
"y": ["school"],
"log_lik": ["school"],
"y_hat": ["school"],
"theta_tilde": ["school"],
},
)
[cmdstan_data](https://python.arviz.org/en/latest/api/generated/arviz.InferenceData.html#arviz.InferenceData)
In the eight schools example I would have:
julia> keys(stan_nts)
(:mu, :theta_tilde, :y_hat, :theta, :tau, :log_lik)
and then could use NamedTupleTools to split this object, e.g.:
@assert success(rc)
stan_nts = read_samples(m_schools, :namedtuples)
keys(stan_nts) |> display => (:mu, :theta_tilde, :y_hat, :theta, :tau, :log_lik)
post = NamedTupleTools.select(stan_nts, (:mu, :theta, :theta_tilde, :tau))
loglik = NamedTupleTools.select(stan_nts, (:log_lik))
pred = NamedTupleTools.select(stan_nts, (:y_hat))
# the inferenceobjects part
idata = convert_to_inference_data(post)
but then expect I would need an object update method like:
convert_to_inference_data!(idata, loglik; group=:log_likelihood) # Group name taken from the InferenceObjects SCHEMA
Definitely would prefer to pass in a single NT and separately specify which form groups and dimensions in groups.
Update: Guess from_namedtuple()
would allow this and returns an InferenceData object? I'll play around with this.
Another thought I had is about the name of the method. Currently read_samples(m, :named tuples)
is lower level than what is represented in an InferenceObjects object. I wonder if convert(InferenceObjects, m:SampleModel; ...)
wouldn't be a better name. The ...
would specify groups and dims, warmups, etc with reasonably chosen defaults.
In the branch inferencedata
of StanSample I have created a combined InferenceData object containing:
InferenceData with groups:
> posterior
> posterior_predictive
> log_likelihood
> sample_stats
Using from_namedtuple()
works well.
The code is in `./test/test_inferencedata/test_inferencedata.jl.
The next step is including warmups and then turn it into a proper function/method.
I wonder if InferenceObjects have a method to add multiple groups? E.g. from your link to the PyStan examples they use
az.from_cmdstan
in one of the examples:Update: Guess
from_namedtuple()
would allow this and returns an InferenceData object? I'll play around with this.
Yes, from_namedtuple
does this, e.g.:
julia> idata = from_namedtuple(stan_nts; posterior_predictive=:y_hat, log_likelihood=:log_lik)
InferenceData with groups:
> posterior
> posterior_predictive
> log_likelihood
julia> keys(idata.posterior)
(:mu, :theta_tilde, :theta, :tau)
julia> keys(idata.posterior_predictive)
(:y_hat,)
julia> keys(idata.log_likelihood)
(:log_lik,)
In this case, what you'd really want is to rename :y_hat
and :log_lik
to :y
within their group, which would usually be supported by the syntax of a special from_XXX
method, but actually, I think we could expand the syntax to allow either of the following:
idata = from_namedtuple(stan_nts; posterior_predictive=:y_hat=>:y, log_likelihood=:log_lik=>:y)
idata = from_namedtuple(stan_nts; posterior_predictive=(y=:y_hat,), log_likelihood=(y=:log_lik,))
The former is inspired by DataFrames's selection syntax.
Another thought I had is about the name of the method. Currently
read_samples(m, :named tuples)
is lower level than what is represented in an InferenceObjects object. I wonder ifconvert(InferenceObjects, m:SampleModel; ...)
wouldn't be a better name. The...
would specify groups and dims, warmups, etc with reasonably chosen defaults.
Yes, actually, this is what we generally do for converting different types. So each type that is convertible to an inference data usually has a from_XXX
name (e.g. from_mcmcchains
, from_namedtuple
), etc, that has keyword arguments to fine-tune how the type is converted. Then convert_to_inference_data
is overloaded for that type with suitable options for keyword arguments. Because one's group data might be in different formats (e.g. in Turing, observed data could be a NamedTuple
, while posterior predictions are a Dict
and posterior is an MCMCChains.Chains
), when passing to convert_to_inference_data
, one really can't control in fine detail how each group is converted, which is why from_XXX
methods exist. 's a generic method convert(InferenceData, obj::Any; kwargs...)
that forwards to convert_to_inference_data
.
In terms of design, I would like to make from_XXX
methods obsolete. Also, since conversion isn't really what's going on here, I'm considering renaming convert_to_inference_data
as just inference_data
. I would like devs to just be able to overload inference_data
for their type, but I haven't landed on a design that didn't have more problems than the current one.
but then expect I would need an object update method like:
convert_to_inference_data!(idata, loglik; group=:log_likelihood) # Group name taken from the InferenceObjects SCHEMA
Currently InferenceData
and Dataset
use a NamedTuple
storage, so they are immutable, but I'm working on a PR to give them Dict
storage. Then one would use Base.merge!
or setproperty!
to add to an existing InferenceData
.
In the branch
inferencedata
of StanSample I have created a combined InferenceData object containing:The code is in `./test/test_inferencedata/test_inferencedata.jl.
Great! I'll take a look in the next few days. If you open a draft PR, I can comment directly on the code.
The next step is including warmups and then turn it into a proper function/method.
Currently from_namedtuple
is still missing some of the warm-up groups, so I'll get those added to make this task easier.
EDIT: issue opened at https://github.com/arviz-devs/InferenceObjects.jl/issues/28
Thanks for your comments. The main parts shown below work. I added a few other options as a comment.
stan_nts = read_samples(m_schools, :namedtuples; include_internals=true)
keys(stan_nts) |> display
# (:treedepth__, :theta_tilde, :energy__, :y_hat, :divergent__, :accept_stat__,
# :n_leapfrog__, :mu, :lp__, :stepsize__, :tau, :theta, :log_lik)
function select_nt_ranges(nt::NamedTuple, ranges=[1:1000, 1001:2000])
dct = convert(Dict, nt)
dct1 = Dict{Symbol, Any}()
for key in keys(dct)
dct1[key] = dct[key][ranges[1]]
end
nt1 = namedtuple(dct1)
dct2 = Dict{Symbol, Any}()
for key in keys(dct)
dct2[key] = dct[key][ranges[2]]
end
nt2 = namedtuple(dct2)
[nt1, nt2]
end
post_warmup, post = select_nt_ranges(stan_nts) # Use "default" ranges from SampleModel
#y_hat_warmup, y_hat = select_nt_ranges(NamedTupleTools.select(stan_nts, (:y_hat,)))
#log_lik_warmup, log_lik = select_nt_ranges(NamedTupleTools.select(stan_nts, (:log_lik,)))
#internals_warmup, internals_nts = select_nt_ranges(NamedTupleTools.select(stan_nts,
# (:treedepth__, :energy__, :divergent__, :accept_stat__, :n_leapfrog__, :lp__, :stepsize__)))
idata = from_namedtuple(
post;
posterior_predictive = (:y_hat,),
log_likelihood = (:log_lik,),
sample_stats = (:lp__, :treedepth__, :stepsize__, :n_leapfrog__, :energy__, :divergent__, :accept_stat__),
#warmup_posterior = post_warmup
)
# What would be ideal:
#=
idata = from_namedtuple(
stan_nts;
posterior = (keys=(:mu, :theta, :theta_tilde, :tau), range=1001:2000),
posterior_predictive = (keys=(:y_hat => :y,), range=1001:2000),
log_likelihood = (keys=(:log_lik => :y,), range=1001:2000),
sample_stats = (
keys=(:lp__, :treedepth__, :stepsize__, :n_leapfrog__, :energy__, :divergent__, :accept_stat__),
range=1001:2000),
#warmup_posterior = (keys=(:mu, :theta, :theta_tilde, :tau), range=1:2000),
# etc.
)
=#
# With a Dict based InferenceData object a similar result is possible with above `select_nt_ranges()` calls.
println()
idata |> display
println()
idata.posterior |> display
println()
idata.posterior_predictive |> display
println()
idata.log_likelihood |> display
println()
idata.sample_stats |> display
Hi Seth, I need to study DimensionalData further. Above ends up with the wrong dimensions in InferenceData and also drops most of log_lik. I'll dig a bit deeper the next couple of days. Simply calling convert_to_inference_data(stan_nts)
comes close but without proper groups.
@goedman looks great so far! Yeah I see that the dimensions are wrong. This seems to be due to how select_nt_ranges
is selecting, e.g.
julia> stan_nts.theta_tilde |> size
(8, 2000, 4)
julia> post.theta_tilde |> size
(1000,)
So the function is discarding the chains dimension and the leading dimension(s). In InferenceData, for sample-based groups, a vector is interpreted as a single draw with many chains. This should probably change (we recently changed the default dimension ordering in InferenceObjects to be more Julian), hence why you're getting a single draw
but 1000 chain
s.
This can absolutely be worked around without DimensionalData; ideally no DimensionalData knowledge is needed to work with InferenceObjects; knowing it just unlocks more functionality. Since you know the draws are indexed starting at 1, you can do this to separate the posterior draws from the warmup:
julia> idata2 = let
idata_warmup = idata[draw=1:1000]
idata_postwarmup = idata[draw=1001:2000]
idata_warmup_rename = InferenceData(NamedTuple(Symbol("warmup_$k") => idata_warmup[k] for k in keys(idata_warmup)))
merge(idata_postwarmup, idata_warmup_rename)
end
InferenceData with groups:
> posterior
> posterior_predictive
> log_likelihood
> sample_stats
> warmup_posterior
> warmup_posterior_predictive
> warmup_sample_stats
> warmup_log_likelihood
julia> idata2.posterior
Dataset with dimensions:
Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
Dim{:draw} Sampled{Int64} 1001:2000 ForwardOrdered Regular Points,
Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
:theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
:mu Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:tau Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:theta Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
with metadata Dict{String, Any} with 1 entry:
"created_at" => "2022-11-03T14:46:48.145"
julia> idata2.warmup_posterior
Dataset with dimensions:
Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
Dim{:draw} Sampled{Int64} 1:1000 ForwardOrdered Regular Points,
Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
:theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
:mu Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:tau Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:theta Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
with metadata Dict{String, Any} with 1 entry:
"created_at" => "2022-11-03T14:46:48.145"
I would suggest resetting the draw
indices post-warmup to count from 1. This probably requires using the DimensionalData API, and I'll look up how to do that.
One more note is that the InferenceData schema does have some rules about what certain sampling statistics should be named. This list includes all of the usual Stan sample statistics, so to completely comply with the spec, these parameters should be renamed. ArviZ.jl already defines this map: https://github.com/arviz-devs/ArviZ.jl/blob/1f642377ec01b9f5ef4d6ebd164604f65edf79de/src/mcmcchains.jl#L9-L17, so you could do:
julia> stan_key_map = (
n_leapfrog__=:n_steps,
treedepth__=:tree_depth,
energy__=:energy,
lp__=:lp,
stepsize__=:step_size,
divergent__=:diverging,
accept_stat__=:acceptance_rate,
);
julia> sample_stats_rekey = InferenceObjects.Dataset((; (stan_key_map[k] => idata.sample_stats[k] for k in keys(idata.sample_stats))...));
julia> idata2 = merge(idata, InferenceData(; sample_stats=sample_stats_rekey))
InferenceData with groups:
> posterior
> posterior_predictive
> log_likelihood
> sample_stats
julia> idata2.sample_stats
Dataset with dimensions:
Dim{:draw} Sampled{Int64} Base.OneTo(2000) ForwardOrdered Regular Points,
Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points
and 7 layers:
:lp Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:tree_depth Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:step_size Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:n_steps Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:energy Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:diverging Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
:acceptance_rate Float64 dims: Dim{:draw}, Dim{:chain} (2000×4)
A few remaining things you could do would be:
observed_data
and constant_data
groups. This would be in the data files, but you would need the user to specify which variables go where by name. e.g. in arviz.from_cmdstan
users can provide observed_data_var
or constant_data_var
. Similar options could be available for log-likelihood, but the question is whether you want to support these keyword arguments in read_samples
or not.arviz
uses save_warmup
, but include_warmup
may be more appropriate here. inference_library
to "Stan"
(or "StanSample, or whatever you think is most descriptive), and even set a version number
inference_library_version`. Perhaps there's other metadata that the user might find interesting.# What would be ideal: #= idata = from_namedtuple( stan_nts; posterior = (keys=(:mu, :theta, :theta_tilde, :tau), range=1001:2000), posterior_predictive = (keys=(:y_hat => :y,), range=1001:2000), log_likelihood = (keys=(:log_lik => :y,), range=1001:2000), sample_stats = ( keys=(:lp__, :treedepth__, :stepsize__, :n_leapfrog__, :energy__, :divergent__, :accept_stat__), range=1001:2000), #warmup_posterior = (keys=(:mu, :theta, :theta_tilde, :tau), range=1:2000), # etc. ) =#
I'd prefer foregoing the specialized keys
and range
keywords, since this could collide if the user has variables named keys
and range
. But I do think we can provide a syntax accepting a tuple of Symbol
s to select variables for those groups.
Alternatively, the syntax in https://github.com/StanJulia/StanSample.jl/issues/60#issuecomment-1297770614 would capture this as well as allow for renaming of variables:
idata = from_namedtuple(stan_nts; posterior_predictive=:y_hat=>:y, log_likelihood=:log_lik=>:y) idata = from_namedtuple(stan_nts; posterior_predictive=(y=:y_hat,), log_likelihood=(y=:log_lik,))
Do you have a preference for either of these syntaxes?
I would suggest resetting the
draw
indices post-warmup to count from 1. This probably requires using the DimensionalData API, and I'll look up how to do that.
Here's how to do this:
julia> using DimensionalData
julia> idata3 = InferenceData(map(NamedTuple(idata2)) do ds
DimensionalData.set(ds; draw=axes(ds, :draw))
end)
InferenceData with groups:
> posterior
> posterior_predictive
> log_likelihood
> sample_stats
> warmup_posterior
> warmup_posterior_predictive
> warmup_sample_stats
> warmup_log_likelihood
julia> idata3.posterior
Dataset with dimensions:
Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
Dim{:draw} Sampled{Int64} Base.OneTo(1000) ForwardOrdered Regular Points,
Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
:theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
:mu Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:tau Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:theta Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
with metadata Dict{String, Any} with 1 entry:
"created_at" => "2022-11-03T19:29:57.427"
julia> idata3.warmup_posterior
Dataset with dimensions:
Dim{:theta_tilde_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points,
Dim{:draw} Sampled{Int64} Base.OneTo(1000) ForwardOrdered Regular Points,
Dim{:chain} Sampled{Int64} Base.OneTo(4) ForwardOrdered Regular Points,
Dim{:theta_dim_1} Sampled{Int64} Base.OneTo(8) ForwardOrdered Regular Points
and 4 layers:
:theta_tilde Float64 dims: Dim{:theta_tilde_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
:mu Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:tau Float64 dims: Dim{:draw}, Dim{:chain} (1000×4)
:theta Float64 dims: Dim{:theta_dim_1}, Dim{:draw}, Dim{:chain} (8×1000×4)
with metadata Dict{String, Any} with 1 entry:
"created_at" => "2022-11-03T19:29:57.427"
Oh, also, it might be good to allow users to specify dims
and coords
, but again, only if read_samples
accepts keyword arguments.
@goedman I took some time tonight to rethink the conversion pipeline to InferenceData
: https://github.com/arviz-devs/InferenceObjects.jl/issues/32.
With this pipeline, you would implement inferencedata(::SampleModel; kwargs...)
and then read_samples(model::SampleModel, :inferencedata)
would dispatch to this method. You wouldn't need to worry about dims
, coords
, or subsetting variables for different groups. If the user needs those things, they would call inferencedata
instead of read_samples
, and they would provide the desired dims
, coords
, and subsetting.
Hi @sethaxen
Implemented your first 2 suggestions and that works great! By default, if a model contains the warmup samples read_samples(m::SampleModel, :inferencedata)
will split warmup samples from the posterior draws.
I'll use the renaming option to rename y_hat to y and log_lik to y as you suggested earlier and also look into the posterior indices (to start from 1).
I can easily add kwargs to read_sample()
if that is easier/more flexible to handle passing arguments to inferencedata(...)
.
Hi @sethaxen
Just pushed a very first attempt at inferencedata()
. In the corresponding test_inferencedata()
for now I derive the include_warmup
settings from the SampleModel. But in principle I would tend to include what the SampleModel has produced as the work is done already.
Are there any rules/recommendations on what should be stored in inferencedata? In addition to warmup values, many models might not have y_hat and/or log_lik generated in generated_quantities.
Currently read_samples()
by default drops internals but that is currently overridden in inferencedata()
.
For y_hat and log_lik I think I can use merge as you showed above (and used in the current version of inferencedata()
for warmup) before separating out the warmup sections.
On your question above, I think I prefer:
idata = from_namedtuple(stan_nts; posterior_predictive=:y_hat=>:y, log_likelihood=:log_lik=>:y)
but for now I've used your setup for remapping keys.
Just pushed a very first attempt at
inferencedata()
.
Great! Is that at #61? If so I'll take a closer look.
Just pushed a very first attempt at
inferencedata()
. In the correspondingtest_inferencedata()
for now I derive theinclude_warmup
settings from the SampleModel. But in principle I would tend to include what the SampleModel has produced as the work is done already.
That makes sense!. The best argument I can think of for not loading warmup is that for models with many parameters, this might double the memory requirements of loading the draws, since even though the work is done, until loading, the draws are stored on disk. Also, only in rare cases will users need to inspect the warm-up draws for diagnostic purposes.
Are there any rules/recommendations on what should be stored in inferencedata? In addition to warmup values, many models might not have y_hat and/or log_lik generated in generated_quantities.
The InferenceData spec gives some instructions. The non-MCMC groups are observed_data
, constant_data
, and predictions_constant_data
. In general, in Stan these 3 would comprise subsets of the data that the user would need to specify. The MCMC groups are posterior
, sample_stats
, and the downstream groups posterior_predictive,
log_likelihood, and
predictions. In Stan, these would all be either sampled parameters or generated quantities (which I think Stan lumps together in the outputs?), so the user would likewise need to specify which of the sampled parameters should go in which group. If the user provides no such parameter names, then everything would go into
posterior`.
There's also prior
and prior_predictive
, but Stan doesn't care if you're drawing from the prior or posterior, so I think for now it makes sense to assume everything is part of the posterior or these downstream groups and then the user can rename the posterior
to prior
if they know their model was drawing from the prior with MCMC. In the rewrite of the pipeline I'm working on, this should be easier.
On my flight back from Amsterdam to Colorado I made several more changes to Inferencedata.jl in StanSample/utils. All in #61 on the InferenceData branch. I'll go though the InferenceData spec today.
For now, I have removed PosteriorDB from the StanSample.jl inference data branch. I think it belongs on the level of the Stan.jl package and the StatisticalRethinking and RegressionAndOtherStories projects. And I would love to use it there.
Thanks for the feedback @sethaxen !
I pushed a couple of updates to the inferencedata branch that use your suggestions. It furthermore simplified handling of the names used for the posterior_predictive and log_likelihood. This way I don't think we need to use remap on the inferencedata object.
The new inferencedata function looks like:
function inferencedata(m::SampleModel;
include_warmup = m.save_warmup,
log_likelihood_symbol::Union{Nothing, Symbol} = :log_lik,
posterior_predictive_symbol::Union{Nothing, Symbol} = :y_hat,
kwargs...)
# Read in the draws as a NamedTuple with
stan_nts = read_samples(m, :namedtuples; include_internals=true)
# Define the "proper" ArviZ names for the sample statistics group.
sample_stats_key_map = (
n_leapfrog__=:n_steps,
treedepth__=:tree_depth,
energy__=:energy,
lp__=:lp,
stepsize__=:step_size,
divergent__=:diverging,
accept_stat__=:acceptance_rate,
);
# If a log_likelihood_symbol is defined (!= nothing), remove it from the future posterior group
if !isnothing(log_likelihood_symbol)
sample_nts = NamedTuple{filter(∉([log_likelihood_symbol]), keys(stan_nts))}(stan_nts)
end
# If a posterior_predictive_symbol is defined (!= nothing), remove it from the future posterior group
if !isnothing(posterior_predictive_symbol)
sample_nts = NamedTuple{filter(∉([posterior_predictive_symbol]), keys(sample_nts))}(sample_nts)
end
# `sample_nts` now holds remaining parameters and the sample statistics
# Split in 2 separate NamedTuples: posterior_nts and sample_stats_nts
posterior_nts = NamedTuple{filter(∉(keys(sample_stats_key_map)), keys(sample_nts))}(sample_nts)
sample_stats_nts = NamedTuple{filter(∈(keys(sample_stats_key_map)), keys(sample_nts))}(sample_nts)
# Remap the names according to above sample_stats_key_map
sample_stats_nts_rekey =
NamedTuple{map(Base.Fix1(getproperty, sample_stats_key_map), keys(sample_stats_nts))}(
values(sample_stats_nts))
# Create initial inferencedata object with 2 groups
idata = from_namedtuple(sample_nts; sample_stats=sample_stats_nts_rekey)
# Merge log_likelihood and posterior_predictive groups into idata
if posterior_predictive_symbol in keys(stan_nts)
nt = (y = stan_nts[posterior_predictive_symbol],)
idata = merge(idata, from_namedtuple(nt; posterior_predictive = (:y,)))
end
if log_likelihood_symbol in keys(stan_nts)
nt = (y = stan_nts[log_likelihood_symbol],)
idata = merge(idata, from_namedtuple(nt; log_likelihood = (:y,)))
end
# Extract warmup values in separate groups
if include_warmup
idata = let
idata_warmup = idata[draw=1:1000]
idata_postwarmup = idata[draw=1001:2000]
idata_warmup_rename = InferenceData(NamedTuple(Symbol("warmup_$k") => idata_warmup[k] for k in
keys(idata_warmup)))
merge(idata_postwarmup, idata_warmup_rename)
end
end
# TO DO: update the indexing
# TO DO: add other groups (data, etc.)
return idata
end
Hi Seth,
Think I'm doing something wrong here. After creating above InferenceData object, I'm trying to add the observed _data:
nt = (sigma = [15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0], J = 8, y = [28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0])
From the InferenceObjects code I see convert
, convert_to_inference_data
and from_namedtuple
. I tried several of these functions, e.g.:
ds = namedtuple_to_dataset(data; kwargs...)
convert_to_inference_data(ds; group)
or
from_namedtuple(nt; observed_data=keys(nt))
All of these fail, often with stack overflow error. But haven't figured out how to do it correctly.
Hi @goedman, it seems you found an InferenceObjects bug. When https://github.com/arviz-devs/InferenceObjects.jl/pull/36 is finished, it should fix this.
Great. I think I did try replacing .., J = 4, ...
by ..., J = [4], ...
, but I tried many things.
Is my line of thought correct? I.e. merge(idata, from_namedtuple(nt; observed_data=keys(nt)))
will add the observed_data group (from nt) to an existing idata
InferenceData object?
Edit: I think when I updated it to [4] it complained about different length of draw or chain dimension. I read somewhere that check is not done for observed_data and constant_data.
Is my line of thought correct? I.e.
merge(idata, from_namedtuple(nt; observed_data=keys(nt)))
will add the observed_data group (from nt) to an existingidata
InferenceData object?
It's better to pass the observed data directly, e.g. using from_namedtuple(; observed_data=nt)
. The first argument of from_namedtuple
corresponds to the posterior, so by passing nt
in first, you're informing the function that it should expect chain
and draw
dimensions on all passed parameters, which is why it raises a dimension error. Passing keys is only safe for parameter derived from the prior or posterior.
I read somewhere that check is not done for observed_data and constant_data.
That's right, there are 3 groups that are special-cased (also IIRC predictions_constant_data
) to not assume chain
and draw
dimensions.
Hi Seth
_It's better to pass the observed data directly, e.g. using from_namedtuple(; observed_data=nt). The first argument of fromnamedtuple corresponds to the posterior, so by passing nt in first, you're informing the function that it should expect chain and draw dimensions on all passed parameters, which is why it raises a dimension error. Passing keys is only safe for parameter derived from the prior or posterior.
Aah, I’d missed that. With a workaround for issue #36:
nt = namedtuple(data)
# Until InferenceObjects issue #36 is merged
ntu=(sigma=nt.sigma, J=[4], y=nt.y)
idata = merge(idata, from_namedtuple(; observed_data = ntu))
works fine.
I’ll merge the branch inferencedata into master. I’m sure we’ll make more changes over the coming weeks but I would like to have the current version available in Pluto to see how that works out. Earlier a quick test showed DimensionalData displays well.
Rob
Rob J Goedman @.***
Hi Seth,
I created a slightly modified version on inferencedata (inferencedata2()
for now). This corrects the indices for all draws to 1:model.numsamples
. I wasn't able to get this done on the DimensionalData level but it is easily achieved by temporarily creating Dicts instead of NamedTuples:
function inferencedata2(m::SampleModel;
include_warmup = m.save_warmup,
log_likelihood_symbol::Union{Nothing, Symbol} = :log_lik,
posterior_predictive_symbol::Union{Nothing, Symbol} = :y_hat,
kwargs...)
# Read in the draws as a NamedTuple with sample_stats included
stan_nts = read_samples(m, :namedtuples; include_internals=true)
# Convert to a Dict and split into draws and warmup Dicts
# When creating the new Dicts, update sample_stats names
dicts = convert(Dict, stan_nts)
draw_dict = Dict{Symbol, Any}()
warmup_dict = Dict{Symbol, Any}()
if include_warmup
for key in keys(dicts)
if length(size(dicts[key])) == 1
warmup_dict[arviz_names(key)] = dicts[key][1:m.num_warmups]
draw_dict[arviz_names(key)] = dicts[key][(m.num_warmups+1):end]
elseif length(size(dicts[key])) == 2
warmup_dict[arviz_names(key)] = dicts[key][1:m.num_warmups, :]
draw_dict[arviz_names(key)] = dicts[key][(m.num_warmups+1):end, :]
elseif length(size(dicts[key])) == 3
warmup_dict[arviz_names(key)] = dicts[key][:, 1:m.num_warmups, :]
draw_dict[arviz_names(key)] = dicts[key][:, (m.num_warmups+1):end, :]
end
end
end
draw_nts = namedtuple(draw_dict)
warmup_nts = namedtuple(warmup_dict)
@assert keys(draw_nts) == keys(warmup_nts)
# If a log_likelihood_symbol is defined, remove it from the future posterior groups
if !isnothing(log_likelihood_symbol)
sample_nts = NamedTuple{filter(∉([log_likelihood_symbol]), keys(draw_nts))}(draw_nts)
warm_nts = NamedTuple{filter(∉([log_likelihood_symbol]), keys(warmup_nts))}(warmup_nts)
else
sample_nts = draw_nts
warm_nts = warmup_nts
end
# If a posterior_predictive_symbol is defined, remove it from the future posterior group
if !isnothing(posterior_predictive_symbol)
sample_nts = NamedTuple{filter(∉([posterior_predictive_symbol]), keys(sample_nts))}(sample_nts)
warm_nts = NamedTuple{filter(∉([posterior_predictive_symbol]), keys(warm_nts))}(warm_nts)
end
# `sample_nts` and `warm_nts` now holds remaining parameters and the sample statistics
# ArviZ names for the sample statistics group
# Remove from posteriors groups and store in sample_stats groups
sample_stats_keys = (:n_steps, :tree_depth, :energy, :lp, :step_size, :diverging, :acceptance_rate)
# Split both draws and warmup into 2 separate NamedTuples: posterior_nts and sample_stats_nts
posterior_nts = NamedTuple{filter(∉(sample_stats_keys), keys(sample_nts))}(sample_nts)
warmup_posterior_nts = NamedTuple{filter(∉(sample_stats_keys), keys(warm_nts))}(warm_nts)
sample_stats_nts = NamedTuple{filter(∈(sample_stats_keys), keys(sample_nts))}(sample_nts)
warmup_sample_stats_nts = NamedTuple{filter(∈(sample_stats_keys), keys(warm_nts))}(warm_nts)
# Create initial inferencedata object with 2 groups (posterior and sample_stats)
idata = from_namedtuple(posterior_nts; sample_stats=sample_stats_nts, kwargs...)
# Merge both log_likelihood and posterior_predictive groups into idata if present
# Note that log_likelihood and predictive_posterior NamedTuples are obtained from
# draw_nts and warmup_nts directly and in the process being renamed to :y
if !isnothing(posterior_predictive_symbol) && posterior_predictive_symbol in keys(stan_nts)
nt = (y = draw_nts[posterior_predictive_symbol],)
idata = merge(idata, from_namedtuple(nt; posterior_predictive = (:y,)))
end
if !isnothing(log_likelihood_symbol) log_likelihood_symbol in keys(stan_nts)
nt = (y = draw_nts[log_likelihood_symbol],)
idata = merge(idata, from_namedtuple(nt; log_likelihood = (:y,)))
end
# Add warmup groups if so desired
if include_warmup
# Create initial warmup inferencedata object with 2 groups
idata_warmup = from_namedtuple(
warmup_posterior_nts,
sample_stats=warmup_sample_stats_nts,
kwargs...)
# Merge both log_likelihood and posterior_predictive groups into idata_warmup if present
if !isnothing(posterior_predictive_symbol) && posterior_predictive_symbol in keys(stan_nts)
nt = (y = warmup_nts[posterior_predictive_symbol],)
idata_warmup = merge(idata_warmup, from_namedtuple(; posterior_predictive = nt, kwargs...))
end
if !isnothing(log_likelihood_symbol) log_likelihood_symbol in keys(stan_nts)
nt = (y = warmup_nts[log_likelihood_symbol],)
idata_warmup = merge(idata_warmup, from_namedtuple(; log_likelihood = nt, kwargs...))
end
idata_warmup_rename = InferenceData(NamedTuple(Symbol("warmup_$k") => idata_warmup[k] for k in keys(idata_warmup)))
idata = merge(idata, idata_warmup_rename)
end
return idata
end
Currently I'm testing this version in Pluto notebooks.
Hi @goedman, sorry, I missed this comment. I'll take a look at the latest implementation tonight and open a PR with any potential improvements I see.
Hi Seth ( @sethaxen ) the current Dict based version (inferencedat3()
) is intended to be based on a Dict based read_samples(). That way I can separate warmup and draws directly while reading in the samples.
Hi Seth (@sethaxen)
Sorry for the late reply but your updates and v0.2.6 seem to work great including scalar values in the data. Got distracted by https://github.com/roualdes/bridgestan/issues/62#issue-1488505948.
I don't think https://github.com/arviz-devs/InferenceObjects.jl/pull/40#issue-1467747122 has been merged yet.
Sorry for the late reply but your updates and v0.2.6 seem to work great including scalar values in the data.
No problem, and great!
I don't think arviz-devs/InferenceObjects.jl#40 (comment) has been merged yet.
It's now been merged. RE whether to use permutedims
or PermuteDims
to adopt the new dimension ordering, I recommend permutedims
(allocates a permuted copy), since downstream operations such as diagnostics should in general be faster if the actual memory layout corresponds to the new dimension ordering.
Hi Seth (@sethaxen) Works fine, just merged in StanSample.jl v6.13.8, example notebook will be in Stan v9.10.5.
Hi Seth (@sethaxen) Works fine, just merged in StanSample.jl v6.13.8, example notebook will be in Stan v9.10.5.
Great! I'll open a PR in the next few days to support extracting observed_data
and constant_data
, and then I think we can close this issue!
InferenceObjects.jl implements the
Dataset
andInferenceData
contains for outputs of Bayesian inference. These are the storage types used by ArviZ.jl (Python ArviZ also implements anInferenceData
using the same spec, which is used by PyMC), and there's discussion about ultimately usingInferenceData
as an official container for sampling results in Turing (see https://github.com/TuringLang/MCMCChains.jl/issues/381). It would be convenient if StanSample supported this as an output format:inferencedata
. From the current output formats, this would be fairly straightforward. e.g.Here's what these outputs look like:
Dataset
is aDimensionalData.AbstractDimStack
, and its variables areDimArray
s.Unlike the other output formats,
InferenceData
can store sampling statistics (like divergences and tree depth), data, predictions, and warmup draws, so this information could also be unpacked fromstan_model
into theInferenceData
object to be more easily used in downstream analyses (here's an example with PyStan and the Python implementation ofInferenceData
: https://python.arviz.org/en/latest/getting_started/CreatingInferenceData.html#from-pystan)There are more options that might be convenient for users (e.g. specifying dimension names and coordinates) that wouldn't fit into the
read_samples
interface, so it would probably still be a good idea to have a method with more options live somewhere, e.g. here, in ArviZ itself, or in a glue package (e.g. StanInferenceObjects.jl).