Open ParadaCarleton opened 3 years ago
@itsdfish
Thank you Carlos. I tend to agree with you that today WAIC, LOO-IC (and WBIC, I don't know this one) are the information criteria to use. I'm including Chis on this response because he specifically asked to include these.
As an interim solutions we can also include a clear warning, similar to your above input or Richard McElreath's positioning of these older ic's. From Chris' thumb up I assume he also agrees with you message
I will take a look at WBIC although right now I have a bit too much on my plate.
EDIT: Given that this package is GPL3 licensed it might still be better to develop a new, MIT licensed package directly from the descriptions of the different ic's. From the StatisticalRethinking introductory/learning point of view MIT licensing is not very important, but for general use it is.
Rob
Rob, I agree also.
I was in the process of reimplementing PSIS-LOO but ran into some difficulties. Do you know anyone who is familiar with the algorithm?
What about Aki Vehtari? Are you using "Vehtari, A., A. Gelman, and J. Gabry. 2015. Efficient implementation of leave-one-out cross-validation and WAIC for evaluating fitted Bayesian models"?
Yeah. I am using that article. My sticking point is the procedure for fitting the inverse Pareto.
Yeah. I am using that article. My sticking point is the procedure for fitting the inverse Pareto.
Funny coincidence -- I'm actually reimplementing it as well, since the R LOO package is looking at using Julia in its backend to speed up its calculations. I've got the code for fitting the inverse Pareto written up already, and I added some more comments that should make it easier to understand.
Note that the generalized Pareto distribution is like a slightly wonky monomial, and you can easily fit a monomial between any pair of points. The GPD fitting procedure takes advantage of this in a way that works a bit like the Theil-Sen estimator. You start by taking the first quartile as point A, and then for each individual point B in your sample, you fit a GPD going through A and B. This gives you a set of pointwise estimates for the shape parameter k. When this is done, you take the average of all of these pointwise estimators, weighted by the likelihood that they would produce this data set. At the end, an (optional) adjustment to k-hat is used, where you average k-hat together with a weakly informative prior; this reduces variance at the cost of a little bit of bias.
function gpdfit(x::AbstractArray, min_grid_pts::Int = 30, sort_x::Bool = true)
# x must be sorted, but we can skip if x is already sorted
if sort_x
sort!(x)
end
n = length(x)
m = min_grid_pts + isqrt(n) # isqrt = floor sqrt
prior = 3
n_0 = 10 # determines how strongly to nudge kHat towards .5
# Estimate parameters
xStar = x[(n+2) ÷ 4] # take first sample quartile
# build pointwise estimates of k and θ by using each element of the sample.
vectorθ = @. 1 / x[n] + (1 - sqrt(m / (collect(1:m) - .5))) / prior / xStar
vectorK = mean(log1p.(- vectorθ .* x'), dims = 2) # take mean of each row
logLikelihood = @. log(- vectorθ / vectorK) - vectorK - 1 # Calculate log-likelihood at each estimate
weights = @. 1 / sum(exp(logLikelihood - logLikelihood')) # Calculate weights from log-likelihood
θHat = weights ⋅ vectorθ # Take the dot product of weights and pointwise estimates of θ to get the full estimate
kHat = mean(log1p.(- θHat .* x))
σ = -kHat / θHat
# Drag towards .5 to reduce variance for small n
if wip
kHat = (kHat * n + .5 * n_0) / (n + n_0)
end
return kHat, σ
end
(Note that I haven't tested this code yet.)
@ParadaCarleton, thank you for sharing your code. Given that you know more about PSIS-LOO and related metrics than I do, it might make sense to collaborate. Our main goal was avoid the GNU license and use the MIT license instead. I know that the community has expressed interest in having these metrics, but no one until now has had the expertise to put it together.
@ParadaCarleton, thank you for sharing your code. Given that you know more about PSIS-LOO and related metrics than I do, it might make sense to collaborate. Our main goal was avoid the GNU license and use the MIT license instead. I know that the community has expressed interest in having these metrics, but no one until now has had the expertise to put it together.
I'd love to help with this -- I'm pretty sure I can get PSIS-LOO working, along with some related measures like WAIC/WSIC. I do have a bit less experience with specific implementations of MCMC samplers (especially Stan -- most of my work has been in PyMC3+Turing) and Julia itself. If I could get some kind of function that would take objects produced by Stan, Turing, or Soss and convert them into standard objects I could work with, I'd be happy to build the code that takes these and actually calculates the information criteria for them.
@ParadaCarleton, that sounds great. I will provide a simple example that can be adapted to Stan, Turing and Soss. I can post something here within the next 24 - 36 hours.
Our original plan was to develop a generic method that would operate on arrays and then add conversion methods that would interface with MCMCChains and any other data type. For example:
A generic method:
function psisloo(x::Array, ...)
...
end
A method to interface with MCMCChains
function psisloo(x::Chain, ...)
# some code to extract and format array
...
# generic method
psisloo(extracted_array)
end
With this approach, the model comparison metrics will work with any data format. We can define the most common data formats, such as MCMCChains. Others can develop custom functions as needed. Does this still sound like a good approach to you?
@ParadaCarleton, that sounds great. I will provide a simple example that can be adapted to Stan, Turing and Soss. I can post something here within the next 24 - 36 hours.
Our original plan was to develop a generic method that would operate on arrays and then add conversion methods that would interface with MCMCChains and any other data type. For example:
A generic method:
function psisloo(x::Array, ...) ... end
A method to interface with MCMCChains
function psisloo(x::Chain, ...) # some code to extract and format array ... # generic method psisloo(extracted_array) end
With this approach, the model comparison metrics will work with any data format. We can define the most common data formats, such as MCMCChains. Others can develop custom functions as needed. Does this still sound like a good approach to you?
I think this sounds good, although I would make the generic method use the AbstractMCMC API rather than just an array, and convert other objects to AbstractMCMC objects. The problem with a basic array is that it would get very confusing to keep track of all the different things.
@ParadaCarleton
I'm trying to wrap my mind about your 2 (very interesting) questions:
use the AbstractMCMC API
On the first item, in Stan you can:
chns = read_samples(m1_1s; output_format=:mcmcchains)
where m1_1s is the sampled Stan model. Typically m1_1 is the Stan language model.
This would provide a chns
object with typeof(chns) <: AbstractMCMC.AbstractChains
. Would that be enough for your approach to LOO & company?
I'm not sure if e.g. stepper functionality makes sense although maybe it is possible to come up with an mcmcsample
method.
Your second question, on the MCMCChains.jl repo, is also very interesting. My main objection to MCMCChains.jl has always been the AxisArray base. But with your suggested approach we could, once convinced the chains are proper, use Makie Layers.
Maybe this combination could result in 2 new packages BayesMakiePlots.jl and a new, MIT licensed, version of StatsModelsComparisons.jl (or a better name for the latter repo)?
Regarding the first question, I actually made a mistake -- since I wasn't familiar with the AbstractMCMC or MCMCChains APIs, I didn't realize they don't really do what I need. I'm still trying to figure out how you get the log-pdf from Stan, Turing, and Soss models after leaving one observation out, which is the main thing I need to build PSIS-LOO.
As for the second thing, that's a longer-term interest that could be discussed in another thread -- for now, I'm just trying to get PSIS-LOO working.
Would it help if I augment the arsenic example in StatsModelComparisons.jl (using Stan) to generate an MCMCChain object with the log-pdfs and maybe Chris can do the same with Turing (using VarInfo?). I can take a look at Soss as well or maybe ask Chad?
i would store the log-pdfs in a separate section of the MCMCChain.
I fully agree with the second part of your reply. Although I find the pk-plots useful.
Would it help if I augment the arsenic example in StatsModelComparisons.jl (using Stan) to generate an MCMCChain object with the log-pdfs and maybe Chris can do the same with Turing (using VarInfo?). I can take a look at Soss as well or maybe ask Chad?
i would store the log-pdfs in a separate section of the MCMCChain.
I fully agree with the second part of your reply. Although I find the pk-plots useful.
I think Soss support would be great eventually, but Soss is definitely the least popular of the PPLs -- we can focus on getting Turing and Stan working first, then find a way to include Soss. It looks like the relevant objects are the MCMCChain object, which holds a sample from the posterior distribution, and the VarInfo
object, which holds information about the full model. I'm going through ArviZ.jl at the moment to see how they did it.
As for turning this into a full package, the inference algorithm itself is pretty simple, but looking through the LOO package in R, things like "Check to make sure the user didn't make a mistake when they input their data" or "Clean the data so it's usable" seem to make up most of the actual code and could be pretty time-consuming to write.
I'm currently implementing PSIS-LOO for matrices, following the API for the loo package in R as closely as possible. Functions converting from ArviZ.jl and Stanfit objects will probably end up being useful, although InferenceData
will probably be getting a revamp fairly soon to use Julia objects, so I'd hold off on ArviZ objects.
That looks pretty easy to do for Stan.jl (I don't know the details of ArviZ.jl).
Hi Carlos, not sure if you follow https://github.com/TuringLang/MCMCChains.jl/issues/270#issuecomment-863225324 .
Rob
Hi Carlos, not sure if you follow TuringLang/MCMCChains.jl#270 (comment) .
Rob
Hi Rob, I hadn't followed it, but I've made some comments now.
@itsdfish I have the PSIS code up here. Pull requests adding support for Chains objects from MCMCChains or for Stanfit objects would be very welcome.
Thanks @ParadaCarleton. I will have some time later this week. Would you consider the test file to be a good example of how to use the package?
Hi Carlos, thanks for this work!
I'm trying to temporarily incorporate your work into StatsModelComparisons.jl by adding ParetoSmooth.jl and RData.jl as dependencies:
(StatsModelComparisons) pkg> st
Project StatsModelComparisons v1.0.2
Status `~/.julia/dev/StatsModelComparisons/Project.toml`
[336ed68f] CSV v0.8.5
[a68b5a21] ParetoSmooth v0.1.0 `~/.julia/dev/ParetoSmooth`
[df47a6cb] RData v0.8.2
[3cdcf5f2] RecipesBase v1.1.1
[c1514b29] StanSample v3.1.0
[4c63d2b9] StatsFuns v0.9.8
[10745b16] Statistics
[8dfed614] Test
when I run a slightly modified version of your runtests.jl:
using ParetoSmooth
using Test
using Statistics
import RData
ProjDir = @__DIR__
cd(ProjDir)
let ogArray = RData.load("Example_Log_Likelihood_Array.RData")["x"]
global logLikelihoodArray = copy(permutedims(ogArray, [3, 1, 2]))
end
let ogWeights = RData.load("weightMatrix.RData")["weightMatrix"]
global rWeights = exp.(permutedims(reshape(ogWeights, 500, 2, 32), [3, 1, 2]))
end
rel_eff = RData.load("Rel_Eff.RData")["rel_eff"]
rPsis = RData.load("Psis_Object.RData")["psisObject"]
juliaPsis = psis(logLikelihoodArray)
relEffSpecified = psis(logLikelihoodArray, rel_eff)
@testset "ParetoSmooth.jl" begin
@test sqrt(mean((relEffSpecified.weights - rWeights).^2)) ≤ .0001 # RMSE ≤ .0001 (~10%)
end
I get:
julia> include("/Users/rob/.julia/dev/StatsModelComparisons/research/paretosmooth.jl");
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
ERROR: LoadError: MethodError: no method matching check_input_validity_psis(::Matrix{Float64}, ::Vector{Float64})
Closest candidates are:
check_input_validity_psis(::AbstractArray{T, 3}, ::AbstractVector{T}) where T<:AbstractFloat at /Users/rob/.julia/dev/ParetoSmooth/src/ImportanceSampling.jl:236
Stacktrace:
[1] psis(log_ratios::Array{Float64, 3}, rel_eff::Vector{Float64}; source::String)
@ ParetoSmooth.ImportanceSampling ~/.julia/dev/ParetoSmooth/src/ImportanceSampling.jl:80
[2] psis(log_ratios::Array{Float64, 3}, rel_eff::Vector{Float64}) (repeats 2 times)
@ ParetoSmooth.ImportanceSampling ~/.julia/dev/ParetoSmooth/src/ImportanceSampling.jl:44
[3] top-level scope
@ ~/.julia/dev/StatsModelComparisons/research/paretosmooth.jl:18
[4] include(fname::String)
@ Base.MainInclude ./client.jl:444
[5] top-level scope
@ REPL[6]:1
in expression starting at /Users/rob/.julia/dev/StatsModelComparisons/research/paretosmooth.jl:18
If I understand the setup right, ;source=:mcmc
for this example is not required. I tried different options.
I get a similar error if I try this with a reshaped log_lik matrix generated in Stan in the cars_waic example. I think I need to reshape that in an Array with [1000, 50, 4]. log_lik contains an appended-chains version.
In order to be able to run the cars-waic example in SMC.jl I also added temporarily StanSample.jl, but that is not affecting above initial test script.
Aah:
julia> size(rPsis["log_weights"])
(1000, 32)
julia> size(rPsis["diagnostics"]["pareto_k"])
(32,)
julia> size(rPsis["diagnostics"]["n_eff"])
(32,)
it looks like my reshaping of log_lik is wrong and I need to permute the dims.
Ok, the problem is the shape of log_ratios
when check_input_validity_psis()
is called.
If I use the original format of log_ratios ( size(log_ratios) = (32, 500,2)
) it passes that test and:
ParetoSmooth.jl: Test Failed at /Users/rob/.julia/dev/StatsModelComparisons/research/paretosmooth.jl:30
Expression: sqrt(mean((relEffSpecified.weights - rWeights) .^ 2)) ≤ 0.0001
Evaluated: 0.00040040987501849365 ≤ 0.0001
But with the upper bound at 0.001 it will succeed. Not sure it that upper bound is too liberal.
Sorry about that -- I accidentally messed up my Github version control and used one that had a bug in it. Version 0.1.1 should be working. RData shouldn't be necessary, it's just a test requirement.
No problem, needed to look at it more closely anyway. It seems to be working now with the log_lik array generated in the car_waic example although the results appear rather different.
I'll also do some more checks tomorrow.
Edit: And the test is now indeed within 10%.
Attached pk_plots of 2 runs where pk is the current version StatsModelComparisons.jl (top plot) and pareto_k the ParetoSmooth.jl k values (bottom plot). Of course the pk_plots will be different for each run, they definitely look reasonable.
Hi Rob,
Can you seed the RNG so that both receive the same input? That might be a more direct comparison.
In a sense both do get the exact same log_lik matrix as input (4000 x 50). Waics are of course identical.
I'm trying to figure out how to compute this from the Psis structure fields:
ll = reshape(nt_cars.log_lik, 50, 1000, 4); # Log_lik is 50 x 4000 in the NamedTuple
# Will soon switch to the Tables.jl api
psis_ll = psis(ll)
(;weights, pareto_k, ess, tail_len, posterior_sample_size, data_size) = psis_ll; # Only in Julia 1.7 and up
Adding Random.seed!(123)
both before calling psisloo()
and before calling psis()
makes no difference.
Sorry for noise. I thought that you were passing two different sets of chains, which would provide different results. I don't know much about the diagnostics. Do you have reason to expect the diagnostic plots to differ when given the same input?
I was trying to compute the waic based on the weights computed by psis()
and think I can do something like:
lwp = deepcopy(ll)
lwp += psis_ll.weights;
lwpt = reshape(lwp, 50, 4000)';
loos = reshape(logsumexp(lwpt; dims=1), size(lwpt, 2));
@show loo = sum(loos)
@show 2loo
which results in:
-2 * (sum(lppd) - sum(pwaic)) = 421.73093081708123
waic(log_lik) = (WAIC = 421.7287980706158, lppd = -206.5999724777712, penalty = 4.264426557536728, std_err = 16.297027050407028)
-2loo = 421.1547017803024
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
loo = sum(loos) = 208.11604949369575
2loo = 416.2320989873915
which is close. But is that close enough? The R results were near 421, but I'll check that.
To answer your question, no, I don't think so. The plots only change when a new set of draws is generated.
R results:
> data(cars)
> m <- quap(
+ alist(
+ dist ~ dnorm(mu,sigma),
+ mu <- a + b*speed,
+ a ~ dnorm(0,100),
+ b ~ dnorm(0,10),
+ sigma ~ dexp(1)
+ ) , data=cars )
> set.seed(94)
> post <- extract.samples(m,n=1000)
>
> ## R code 7.20
> n_samples <- 1000
> logprob <- sapply( 1:n_samples ,
+ function(s) {
+ mu <- post$a[s] + post$b[s]*cars$speed
+ dnorm( cars$dist , mu , post$sigma[s] , log=TRUE )
+ } )
>
> ## R code 7.21
> n_cases <- nrow(cars)
> lppd <- sapply( 1:n_cases , function(i) log_sum_exp(logprob[i,]) - log(n_samples) )
>
> ## R code 7.22
> pWAIC <- sapply( 1:n_cases , function(i) var(logprob[i,]) )
>
> ## R code 7.23
> -2*( sum(lppd) - sum(pWAIC) )
[1] 423.3169
But not using the quadratic approximation but Stan proper and taking 4000 samples:
> m <- ulam(
+ alist(
+ dist ~ dnorm(mu,sigma),
+ mu <- a + b*speed,
+ a ~ dnorm(0,100),
+ b ~ dnorm(0,10),
+ sigma ~ dexp(1)
+ ) , data=cars , chains=4 , iter=2000)
recompiling to avoid crashing R session
SAMPLING FOR MODEL 'cd82d23a4157bed9d477761c28767a10' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 1.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 1: Adjust your expectations accordingly!
.. <snipped>
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.838425 seconds (Warm-up)
Chain 4: 0.825601 seconds (Sampling)
Chain 4: 1.66403 seconds (Total)
Chain 4:
>
>
> post <- extract.samples(m, n=4000)
> n_samples <- 4000
> logprob <- sapply( 1:n_samples ,
+ function(s) {
+ mu <- post$a[s] + post$b[s]*cars$speed
+ dnorm( cars$dist , mu , post$sigma[s] , log=TRUE )
+ } )
>
> ## R code 7.21
> n_cases <- nrow(cars)
> lppd <- sapply( 1:n_cases , function(i) log_sum_exp(logprob[i,]) - log(n_samples) )
>
> ## R code 7.22
> pWAIC <- sapply( 1:n_cases , function(i) var(logprob[i,]) )
>
> ## R code 7.23
> -2*( sum(lppd) - sum(pWAIC) )
[1] 421.0427
A difference of about 4 seems quite large if the the R and Julia code are using the same algorithm and settings with the same inputs. It is definitely more than rounding error. Hmmm.
I'll try to use the loglik matrix from R to check both psi's and psisloo.
Using the R generated matrix:
julia> include("/Users/rob/.julia/dev/StatisticalRethinkingStan/research/cars_waic_R.jl");
-2 * (sum(lppd) - sum(pwaic)) = 421.0426501217422
waic(log_lik) = (WAIC = 421.040729342191, lppd = -206.67976595825237, penalty = 3.840598712843116, std_err = 16.161917924434487)
-2loo = 420.55434215126473
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
loo = sum(loos) = 208.03612793433302
2loo = 416.07225586866605
Those values look much closer.
We'll, with the existing psisloo(). But still somewhat off with the new psis().
Sorry, are these comparisons of the WAIC with PSIS-LOO, or of the R and Julia versions, or of the new and old Julia versions? And are they being calculated with the newest version on Github (The main
version, not the 1.0 release)?
These are comparisons between psis() in ParetoSmooth.jl and psisloo() in StatsModelComparisons.jl.
Initially I compared them using Julia & StanJulia but the last results above were using a loglik matrix generated by R.
The results of the today's released version (v0.1.0) of ParetoSmooth.jl are identical to above results.
These are comparisons between psis() in ParetoSmooth.jl and psisloo() in StatsModelComparisons.jl.
Initially I compared them using Julia & StanJulia but the last results above were using a loglik matrix generated by R.
The results of the today's released version (v0.1.0) of ParetoSmooth.jl are identical to above results.
Can you try again with v0.1.1?
Hmm, this is the result:
julia> include("/Users/rob/.julia/dev/StatisticalRethinkingStan/research/cars_waic_R.jl");
-2 * (sum(lppd) - sum(pwaic)) = 421.0426501217422
waic(log_lik) = (WAIC = 421.040729342191, lppd = -206.67976595825237, penalty = 3.840598712843116, std_err = 16.161917924434487)
-2loo = 420.55434215126473
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
loo = sum(loos) = 208.03612793433302
2loo = 416.07225586866605
from running this:
using StatsModelComparisons, Statistics
using StatsFuns, CSV, DataFrames, Random
using ParetoSmooth
#using StatisticalRethinking
ProjDir = @__DIR__
df = CSV.read(joinpath(ProjDir, "cars_logprob.csv"), DataFrame)
log_lik = Matrix(df)';
n_sam, n_obs = size(log_lik)
lppd = reshape(logsumexp(log_lik .- log(n_sam); dims=1), n_obs);
pwaic = [var(log_lik[:, i]) for i in 1:n_obs];
@show -2(sum(lppd) - sum(pwaic))
println()
@show waic(log_lik)
println()
#Random.seed!(123)
loo, loos, pk = psisloo(log_lik);
@show -2loo
#Random.seed!(123)
ll = reshape(Matrix(df), 50, 1000, 4);
psis_ll = psis(ll);
lwp = deepcopy(ll);
lwp += psis_ll.weights;
lwpt = Matrix(reshape(lwp, 50, 4000)');
loos = reshape(logsumexp(lwpt; dims=1), size(lwpt, 2));
@show loo = sum(loos)
@show 2loo
if isdefined(Main, :StatisticalRethinking)
pk_plot(pk)
savefig(joinpath(ProjDir, "pk.png"))
pk_plot(psis_ll.pareto_k)
savefig(joinpath(ProjDir, "pareto_k.png"))
end
with installed packages:
(StatisticalRethinkingStan) pkg> st
Status `~/.julia/dev/StatisticalRethinkingStan/Project.toml`
[80f14c24] AbstractMCMC v3.2.1
[6e4b80f9] BenchmarkTools v1.1.0
[5f4fecfd] BrowseTables v0.3.0
[336ed68f] CSV v0.8.5
[31c24e10] Distributions v0.25.7
[38e38edf] GLM v1.5.1
[c7f686f2] MCMCChains v4.13.0
[d9ec5142] NamedTupleTools v0.13.7
[d96e819e] Parameters v0.12.2
[a68b5a21] ParetoSmooth v0.1.1 `~/.julia/dev/ParetoSmooth`
[08abe8d2] PrettyTables v1.1.0
[276daf66] SpecialFunctions v1.5.1
[fbd8da12] StanOptimize v2.3.0
[e4723793] StanQuap v1.0.5
[c1514b29] StanSample v3.1.0
[2d09df54] StatisticalRethinking v3.4.3
[4c63d2b9] StatsFuns v0.9.8
[854dedd9] StatsModelComparisons v1.0.1
[a41e6734] StructuralCausalModels v1.0.8
[bd369af6] Tables v1.4.4
If you have rethinking
installed in R, I run a slightly modified:
## R code 7.19
data(cars)
m <- ulam(
alist(
dist ~ dnorm(mu,sigma),
mu <- a + b*speed,
a ~ dnorm(0,100),
b ~ dnorm(0,10),
sigma ~ dexp(1)
) , data=cars , chains=4 , iter=2000)
set.seed(94)
post <- extract.samples(m, n=1000)
## R code 7.20
n_samples <- 1000
logprob <- sapply( 1:n_samples ,
function(s) {
mu <- post$a[s] + post$b[s]*cars$speed
dnorm( cars$dist , mu , post$sigma[s] , log=TRUE )
} )
to generate the logprob matrix. Modified to use ulam() and not quap().
@goedman
Carlos, you probably understand this much better than I do, but both PSIS and Waic are to compare models, so a constant offset in the values might be less important?
That's correct -- you might want to double-check that's not the problem. If the answers are off by a constant offset, this shouldn't matter. In addition, are the rel_eff
arguments specified and the same between Julia and R? If not, there might be a problem with MCMCChains' ESS method returning different answers, since there are lots of methods for calculating ESS.
If neither of these work, try changing these two lines:
@turbo @. θ_hats = 1 / sample[len] + (1 - sqrt((m+1) / $(1:m))) / prior / quartile
with
@turbo @. θ_hats = 1 / sample[len] + (1 - sqrt(m / $((1:m) - .5))) / prior / quartile
@turbo @. tail = GPD.gpd_quantile($(1:len) / (len + 1), ξ, σ) + cutoff
with:
@turbo @. tail = GPD.gpd_quantile(($(1:len) - .5) / len, ξ, σ) + cutoff
Try it changing one line at a time, then after changing both lines if that doesn't work. Neither of these should make a meaningful difference -- they're slightly different ECDF interpolations that should be off by at worst 1/1000 = .1% -- but I might be wrong about one or both of them. When I tested they didn't seem to result in a big error relative to the R version, so I'd be very surprised if this causes a 4 point difference in the ELPD calculation.Hi Carlos,
Have tried all three, no change in the loo value in either step.
The update to GDP.jl needed to be:
@turbo @. θ_hats = 1 / sample[len] + (1 - sqrt(m / $((1:m) .- .5))) / prior / quartile
Not quite sure why I needed to add the broadcast '.' before the - .5
.
| In addition, are the rel_eff arguments specified and the same between Julia and R?
I believe the settings in R and psisloo() in StatsModelsComparisons.jl are identical, at least the values were a good match in about 8 or so example cases I used for testing before releasing StatsModelsComparisons.jl. At the same time I have a hard time comparing the handling of rel_eff between the 2 Julia versions. psisloo() certainly does not use MCMCChains' ESS method. Doing the Turing test with the car data might shed more light on this.
Hi Carlos,
Have tried all three, no change in the loo value in either step.
The update to GDP.jl needed to be:
@turbo @. θ_hats = 1 / sample[len] + (1 - sqrt(m / $((1:m) .- .5))) / prior / quartile
Not quite sure why I needed to add the broadcast '.' before the
- .5
.| In addition, are the rel_eff arguments specified and the same between Julia and R?
I believe the settings in R and psisloo() in StatsModelsComparisons.jl are identical, at least the values were a good match in about 8 or so example cases I used for testing before releasing StatsModelsComparisons.jl. At the same time I have a hard time comparing the handling of rel_eff between the 2 Julia versions. psisloo() certainly does not use MCMCChains' ESS method. Doing the Turing test with the car data might shed more light on this.
So, are the comparisons between psisloo()
and the new psis()
? The R version of Loo
has had some pretty significant changes over the past few years, and from what I can see the new psis()
seems to give weights indistinguishable from those of loo
, so it might be that the old psisloo()
has drifted from the new R package? Alternatively, it might be that the testing code has a bug in it.
(As for the broadcast, my original had a typo -- $((1:m) - .5)
, rather than ($(1:m) - .5)
. In the former, the $
prevents broadcasting over the minus sign, while in the latter it only keeps it from broadcasting over the sequence 1:m
.)
The results of the tests are that psisloo() and R’s psis remain identical (I just reran that after upgrading all R packages). The new psis() (yours) gives a slightly different value. I suggest we wait until you are done with the loo extensions because it is very possible my test code is at fault. If that’s not the case we should look at the ess values.
In the research
subdirectory of StatisticalRethinkingStan I added a 2nd check using the chimpanzees dataset from R's rethinking
. Here the K_L_divergence differences are quite big:
julia> include("/Users/rob/.julia/dev/StatisticalRethinkingStan/research/chimpanzees/chimpanzees_psis");
/var/folders/l7/pr04h0650q5dvqttnvs8s2c00000gn/T/jl_xxoffZ/m11_4s.stan updated.
-2 * (sum(lppd) - sum(pwaic)) = 680.4253578535058
waic(log_lik) = (WAIC = 680.4243767462555, lppd = -338.25046442596323, penalty = 1.9617239471644632, std_err = 9.255022377796303)
-2loo = 680.1994143478744
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
loo = sum(loos) = 3842.0770399859157
2loo = 7684.154079971831
If you have rethinking
installed in R, ?PSIS
shows the example R code to run it. I'll check this example with Chris' code.
Given that you mentioned the possibility of more packages for model selection purposes, maybe it makes sense to create a ModelSelection Github organization to have an anker point? Just a thought.
The results of the tests are that psisloo() and R’s psis remain identical (I just reran that after upgrading all R packages). The new psis() (yours) gives a slightly different value. I suggest we wait until you are done with the loo extensions because it is very possible my test code is at fault. If that’s not the case we should look at the ess values.
In the
research
subdirectory of StatisticalRethinkingStan I added a 2nd check using the chimpanzees dataset from R'srethinking
. Here the K_L_divergence differences are quite big:julia> include("/Users/rob/.julia/dev/StatisticalRethinkingStan/research/chimpanzees/chimpanzees_psis"); /var/folders/l7/pr04h0650q5dvqttnvs8s2c00000gn/T/jl_xxoffZ/m11_4s.stan updated. -2 * (sum(lppd) - sum(pwaic)) = 680.4253578535058 waic(log_lik) = (WAIC = 680.4243767462555, lppd = -338.25046442596323, penalty = 1.9617239471644632, std_err = 9.255022377796303) -2loo = 680.1994143478744 [ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not. loo = sum(loos) = 3842.0770399859157 2loo = 7684.154079971831
If you have
rethinking
installed in R,?PSIS
shows the example R code to run it. I'll check this example with Chris' code.Given that you mentioned the possibility of more packages for model selection purposes, maybe it makes sense to create a ModelSelection Github organization to have an anker point? Just a thought.
Ahh, I just plan to build one package for model selection purposes, separately from the implementation of PSIS. This way, people can use PSIS for things like VI+Importance Sampling without having to install a bunch of model comparison tools.
Note that when rel_eff
is specified, ESS methods shouldn't matter (since ESS is only used in calculations of rel_eff
).
The model comparison package has been put here, and now implements a LOO method. The answers looked sensible but I'm not 100% sure they're good either.
Thank you.
I modified the function definition a bit:
function psis_loo(log_likelihood::ArrayType,
score::Function=identity; # Default to log_likelihood (elpd) score function
rel_eff=similar(log_likelihood, 0)
) where {F<:AbstractFloat, ArrayType<:AbstractArray{F, 3}}
because loo
in similar(loo, 0)
is undefined.
The result I'm getting for the cars example:
-2 * (sum(lppd) - sum(pwaic)) = 421.45579489140215
waic(log_lik) = (WAIC = 421.453730532524, lppd = -206.59917968923025, penalty = 4.127685577031707, std_err = 16.19157000365693)
-2loo = 420.8822100662317
Using ParetoSmooth' psis()
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
loo = sum(loos) = 208.1168127112213
2loo = 416.2336254224426
Using Bombe's psis_loo()
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
cars_loo = psis_loo(ll);
sum(cars_loo.pointwise.estimate) = -204.9399874711436
Thank you.
I modified the function definition a bit, because
loo
insimilar(loo, 0)
is undefined.
Thanks. I have no idea how my testing didn't catch that. The function shouldn't have run at all when I tried that.
The result I'm getting for the cars example:
-2 * (sum(lppd) - sum(pwaic)) = 421.45579489140215 waic(log_lik) = (WAIC = 421.453730532524, lppd = -206.59917968923025, penalty = 4.127685577031707, std_err = 16.19157000365693) -2loo = 420.8822100662317 Using ParetoSmooth' psis() [ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not. loo = sum(loos) = 208.1168127112213 2loo = 416.2336254224426 Using Bombe's psis_loo() [ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not. cars_loo = psis_loo(ll); sum(cars_loo.pointwise.estimate) = -204.9399874711436
So it looks like the problem isn't in the addition code.
I've tried updating the code in ParetoSmooth; can you see if the new version works? It seems to give identical weights (up to floating point error) to R's Loo package. (isapprox
returns true
)
Makes little difference. I think, given above your check on the weights, your code at this point is more trustworthy than the old code. I now get:
-2 * (sum(lppd) - sum(pwaic)) = 421.46518110423153
waic(log_lik) = (WAIC = 421.46312317812544, lppd = -206.61673833982252, penalty = 4.114823249240203, std_err = 16.209320889781484)
-2loo = 420.900376736517
Using ParetoSmooth' psis()
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
loo = sum(loos) = 208.0992394555812
2loo = 416.1984789111624
Using Bombe's psis_loo()
[ Info: Adjusting for autocorrelation. If the posterior samples are not autocorrelated, specify the source of the posterior sample using the keyword argument `source`. MCMC samples are always autocorrelated; VI samples are not.
sum(cars_loo.pointwise.estimate) = -204.91024688501258
For completeness, the pareto_k plots. First plot is basically cars_loo.pointwise.pareto_k
, the 2nd plot is pk as computed in the old psis_loo(). I have often wondered why the old pk's go up that much. Maybe your autocorrelation correction is better?
AIC, BIC, and DIC have all been fully supplanted by WAIC, WBIC, and LOO-IC, and there's no longer any real reason to use them. I would suggest removing them from the package to avoid confusing beginners. Someone seeing them used in the package may be led to believe that using them is a good idea, that they provide information different from that provided by the newer information criteria, or that there must be some reason they're in the package. BIC is especially problematic: The name simultaneously gives new users the false impression that the BIC is actually Bayesian and that WAIC/LOO-IC are not Bayesian.
It might be a good idea to implement WBIC to have a replacement for BIC, since BIC has a different use case from the other information criteria in this package (asymptotically maximizing the probability of selecting the correct model, rather than maximizing the expected predictive accuracy of the model). I would also suggest referring to it as the Watanabe-Schwarz Information Criterion to avoid giving the impression that WBIC is "Especially" Bayesian.