Extra test for Turing.jl

TuringLang / ParetoSmooth.jl

An implementation of PSIS algorithms in Julia.

http://turinglang.org/ParetoSmooth.jl/

MIT License

19 stars 12 forks source link

Extra test for Turing.jl #14

Closed ParadaCarleton closed 3 years ago

ParadaCarleton commented 3 years ago

@itsdfish Do you also think you could add a test for result accuracy by using the example log-likelihood array I've included in the RData file? (Or, alternatively, by building a log-likelihood array some other way.) The Chains() constructor should let you build a Chains() object. While the results won't be exactly the same, verifying that the Pareto_k values aren't off by more than 0.1 and that the difference in the ELPD estimates is less than 4 times the MCSE would be good.

itsdfish commented 3 years ago

Hahaha. Of course.

I tried to replicate the deprecation warning without success. I also could not find record of a deprecation in the Turing and AbstractMCMC repos. Would you be able to provide more info about the deprecation and your package versions please?

Here is the version I am using on Julia 1.6.1:

(test) pkg> st
      Status `~/.julia/dev/ParetoSmooth/test/Project.toml`
  [94b1ba4f] AxisKeys v0.1.18
  [31c24e10] Distributions v0.25.11
  [c7f686f2] MCMCChains v4.13.1
  [df47a6cb] RData v0.8.3
  [fce5fe82] Turing v0.16.6
  [9a3f8284] Random
  [10745b16] Statistics
  [8dfed614] Test

I should be able to make the changes later this week.

goedman commented 3 years ago

Currently on J1.6.2 with 8 threads I get the following warning message:

┌ Warning: Conversion of RData.RDummy{0xfe} to Julia is not implemented
└ @ RData ~/.julia/packages/RData/OT7M6/src/convert.jl:198

and once Turing kicks in many of below info messages:

┌ Info: Found initial step size
└   ϵ = 0.4
┌ Info: Found initial step size
└   ϵ = 0.4
┌ Info: Found initial step size
└   ϵ = 0.2
┌ Info: Found initial step size
└   ϵ = 0.2
Sampling (4 threads) 100%|██████████████████████████████████████████████████████████| Time: 0:00:25
┌ Info: **Important Note:** The posterior log-likelihood must be computed with a `for` loop inside a
│ Turing model; broadcasting will result in all observations being treated as if they are a
└ single point.

Package versions (both ParetoSmooth and ParetoSmooth/test) appear similar to what Chris reported.

On Julie 1.7.0-beta3 and up there is an issue with AxisKeys.jl when printing out the PsisLoo object.

Turing.jl needs an update to AxisArrays.jl before MCMCChains.jl can be loaded I think. But in past Turing supported up to the current Julia release (1.6.2).

Not sure which version you were using Carlos.

ParadaCarleton commented 3 years ago

Oh, weird, the deprecation warning isn't showing up for me either anymore. I assume I made some kind of mistake.

ParadaCarleton commented 3 years ago

@itsdfish I took a look at the pareto_k values, and they seem a little weird -- I don't think a Gaussian with a sample size of 100 should have tails that thick. Can you build a test to double-check the results?

itsdfish commented 3 years ago

@ParadaCarleton, sorry for the delay. I have been unusually busy. I should be able to get this this by Saturday at the latest, but I will aim for sooner.

itsdfish commented 3 years ago

@ParadaCarleton, I don't know how to test the kvalues without a ground truth. What you have here appears to achieve that goal.

I believe the problem with k-values computed from the Turing output is that you changed the dimensions for the MCMC samples. Turing produces a 3 dimensional array where the dimensions are [samples, parameters, chains] . I believe Stan.jl does the same, @goedman can you confirm? Is there a reason you made this change? If not, I will submit a pull request with changes

goedman commented 3 years ago

Yes, that is correct.

All currently released versions of CmdStan and StanJulia packages by default read samples in as an Array[draws, params, chains]. Back in 2010 that seemed logical given the structure of cmdstan's .csv files. It was also used by Mamba.Chains (taken as the starting point for the initial version of MCMCChains). And in simple examples you often work with a column of draws.

Recently I've added 2 output_formats (:namedtuple and :keyedarray) that return draws (and samples from generated_quantities such as log_lik) in the format [params/obs, draws/samples, chains].

In StatsModelComparisons.jl I mostly used the appended chains format [draws, params].

I tried to make sense of a suggestion/discussion in Turing but ended up not sure what came out of the discussion.

itsdfish commented 3 years ago

@goedman, thanks for your detailed reply. Considering that the most popular MCMC samplers return an array in the form [draw, parm, chain], it seems like we should keep that convention.

ParadaCarleton commented 3 years ago

@ParadaCarleton, I don't know how to test the kvalues without a ground truth. What you have here appears to achieve that goal.

The easiest way to do that would be to build the same model with the same data in Julia and R. (I would suggest making the "data" equal to the quantiles of a standard normal distribution, for quantiles (1:32) ./ 33.) We can then sample with Stan, save the results of loo in Stan, and then compare these to the results in Julia to make sure they're close enough. If we draw enough samples, the estimates should be close.

I believe the problem with k-values computed from the Turing output is that you changed the dimensions for the MCMC samples. Turing produces a 3 dimensional array where the dimensions are [samples, parameters, chains]. I believe Stan.jl does the same, @goedman can you confirm? Is there a reason you made this change? If not, I will submit a pull request with changes

Yes! Thank you, I thought I had successfully modified the function to convert from [draw, param, chain] to [param, draw, chain], but it looks like I made a mistake along the way. Since the tests passed, I thought this meant the results were good -- actually, this is kind of why I'd like a test comparing the Stan and Turing results -- the current tests passed since the error only affected the numerical results, not the shape of the array that got returned. It's not possible to compare the results to the true values, but I want to make sure that any differences between Stan and Turing are within the range you'd expect from sampling error.

itsdfish commented 3 years ago

Sorry. I am a bit confused about (1) the reason for converting [draw, param, chain] to [param, draw, chain] and (2) why your tests under "basic arrays" do not already test the accuracy of the loo results. All that is required is reverting back to [draw, param, chain] or properly reshaping the array extracted from the chain object. My concern is that this is a lot of work to create redundant tests that are more ambiguous: the mcmc samples may differ, but your tests seem to indicate that the loo functions are highly similar between R and Julia. So what we end up testing is whether the NUTS implementations are similar, but that already tested in AdvancedHMC.

I think it makes the most sense to order arrays as [draw, param, chain] because most packages work that way in Julia, as Rob noted. In either case, a simple warning might prevent a person from using an array of the wrong shape:

const ARRAY_DIMS_WARNING = "The supplied array of mcmc samples indicates you have more parameters than mcmc samples.
This is possible, but highly unusual. Please check that your array of mcmc samples has the following dimensions: [n_samples,n_parms,n_chains]."

n_posterior, n_parms, n_chains = size(samples)
if n_parms > n_posterior
    @info ARRAY_DIMS_WARNING
end

ParadaCarleton commented 3 years ago

Sorry. I am a bit confused about (1) the reason for converting [draw, param, chain] to [param, draw, chain] and (2) why your tests under "basic arrays" do not already test the accuracy of the loo results. All that is required is reverting back to [draw, param, chain] or properly reshaping the array extracted from the chain object. My concern is that this is a lot of work to create redundant tests that are more ambiguous: the mcmc samples may differ, but your tests seem to indicate that the loo functions are highly similar between R and Julia. So what we end up testing is whether the NUTS implementations are similar, but that already tested in AdvancedHMC.

I think it makes the most sense to order arrays as [draw, param, chain] because most packages work that way in Julia, as Rob noted. In either case, a simple warning might prevent a person from using an array of the wrong shape:
const ARRAY_DIMS_WARNING = "The supplied array of mcmc samples indicates you have more parameters than mcmc samples.
This is possible, but highly unusual. Please check that your array of mcmc samples has the following dimensions: [n_samples,n_parms,n_chains]."

n_posterior, n_parms, n_chains = size(samples)
if n_parms > n_posterior
    @info ARRAY_DIMS_WARNING
end

Yep, the reason why I flipped up was an accident -- I mixed up the [draw, param, chain] argument with the [data_point, draw, chain] argument required by the psis_loo function.

The reason is I want to test that Julia/R results are similar is to make sure that the functions are extracting the log-likelihood from Turing objects correctly, not that the Stan/Turing results behave similarly. If the log-likelihood is being extracted correctly, then any differences between the R and Julia implementations should be less than the expected sampling error.

I like that warning, it looks good!

itsdfish commented 3 years ago

@ParadaCarleton, thanks for clarifying. I will submit a PR with a fix for the dimensions, a warning for the dimensions, and an additional test that shows the Turing method produces the same pointwise log likelihoods as the method that accepts a user-defined function. I can look into the test that you describe as time permits. Unfortunately, my schedule is limited over the next week or so. In the meantime, the fixes should give the correct result with a fair degree of certainty.

goedman commented 3 years ago

@itsdfish I do agree that most packages seem to use [draws, params, chains]. The reason I started to experiment is that separation between draws and chains, i.e. to append all chains one has to resort to something like:

        ndraws, nparams, nchains = size(ma);
        rma = reshape(permutedims(ma, [1, 3, 2]), ndraws*nchains, nparams);
        mean(rma, dims=1) |> display

or overload MCMCChains' chainscat().

Edit: There is also the ArViz discussion (which is quite interesting and might use or include AlgebraOfGraphics). ArViz uses in some cases [chains, draws, params]. That does make sense I think.

itsdfish commented 3 years ago

@goedman, that is cumbersome indeed. I wonder whether there are other operations that introduce tradeoffs between the approaches. Knowing the layout of the tradeoff space might help guide a decision.

goedman commented 3 years ago

Good point @itsdfish! Usually it takes me re-doing a substantial part of StatisticalRethinking for the trade-off landscape to emerge.

On my list of options for StanJulia and StatisticalRethinking v4 is also a KeyedArray chain structure using [draws, chains, params] (which works maybe the best of the 3 KeyedArray options). I prefer that over the basic Tables.jl based chains object output_format option in StanSample.jl. Not completely fair of course as Tables.jl is a package developer interface while AxisKeys.jl is more user focused.

Anyway, these are typical results I'm getting for Stan and Turing:

Stan:

[ Info: Some Pareto k values are slightly high (>0.5); some pointwise estimates may be slow to converge or have high variance.
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -63.05 │     6.46 │ -1.26 │    0.13 │
│ naive_est │ -59.25 │     4.83 │ -1.18 │    0.10 │
│   overfit │   3.81 │     1.89 │  0.08 │    0.04 │
└───────────┴────────┴──────────┴───────┴─────────┘
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -69.72 │     4.98 │ -1.39 │    0.10 │
│ naive_est │ -66.70 │     4.17 │ -1.33 │    0.08 │
│   overfit │   3.02 │     0.93 │  0.06 │    0.02 │
└───────────┴────────┴──────────┴───────┴─────────┘
[ Info: Some Pareto k values are slightly high (>0.5); some pointwise estimates may be slow to converge or have high variance.
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -63.74 │     6.44 │ -1.27 │    0.13 │
│ naive_est │ -59.05 │     4.67 │ -1.18 │    0.09 │
│   overfit │   4.70 │     1.91 │  0.09 │    0.04 │
└───────────┴────────┴──────────┴───────┴─────────┘
┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5.1s │   0.00 │  0.00 │   0.67 │
│ m5.3s │  -0.69 │  0.33 │   0.33 │
│ m5.2s │  -6.67 │  4.66 │   0.00 │
└───────┴────────┴───────┴────────┘

Turing:

┌ Warning: Some Pareto k values are very high (>0.7), indicating that PSIS has failed to approximate the true distribution.
└ @ ParetoSmooth ~/.julia/dev/ParetoSmooth/src/LooStructs.jl:96
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -63.00 │     6.55 │ -1.26 │    0.13 │
│ naive_est │ -59.22 │     4.89 │ -1.18 │    0.10 │
│   overfit │   3.78 │     1.93 │  0.08 │    0.04 │
└───────────┴────────┴──────────┴───────┴─────────┘
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -69.68 │     4.96 │ -1.39 │    0.10 │
│ naive_est │ -66.70 │     4.16 │ -1.33 │    0.08 │
│   overfit │   2.98 │     0.92 │  0.06 │    0.02 │
└───────────┴────────┴──────────┴───────┴─────────┘
[ Info: Some Pareto k values are slightly high (>0.5); some pointwise estimates may be slow to converge or have high variance.
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -63.69 │     6.44 │ -1.27 │    0.13 │
│ naive_est │ -59.03 │     4.68 │ -1.18 │    0.09 │
│   overfit │   4.66 │     1.90 │  0.09 │    0.04 │
└───────────┴────────┴──────────┴───────┴─────────┘
┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5_1t │   0.00 │  0.00 │   0.67 │
│ m5_3t │  -0.69 │  0.42 │   0.33 │
│ m5_2t │  -6.68 │  4.74 │   0.00 │
└───────┴────────┴───────┴────────┘

With SR/ulam():

   PSIS    SE dPSIS  dSE pPSIS weight

m5.1u 126.0 12.83 0.0 NA 3.7 0.67 m5.3u 127.4 12.75 1.4 0.75 4.7 0.33 m5.2u 139.5 9.95 13.6 9.33 3.0 0.00

SR includes the -2 factor.

itsdfish commented 3 years ago

@goedman, thanks for sharing those results. Are those based on estimates of the same data? If so, this means that testing loo from independent chains would be imprecise. The SE of the difference is

julia> sqrt(4.96^2 + 4.98^2)
7.02865563248051

in the best case above.

goedman commented 3 years ago

@itsdfish, I would be more concerned if the LooCompare results ended up very different.

There are several Pareto k info/warning messages. In Statistical Rethinking the suggestion is made to replace the Normal distribution for divorce_rate by a Student-t distribution (with thicker tails). This reduces the relative influence of outliers such as Idaho and Maine on the predictions.

Furthermore, as I mentioned once before, I am a bit concerned why Turing inferences appear to vary more than Stan inferences. I should really perform several more simulations on this topic.

Above results tell me that if prediction accuracy is my primary concern, it is expected that the divorce_rate ~ median_age_at_marriage model has better out-of-sample performance than the other 2 models (m5.2 is divorce_rate ~ marriage_rate and m5.3 is divorce_rate ~ median_age_at_marriage + marriage_rate).

I'm sure @ParadaCarleton can explain this way better than I can.

ParadaCarleton commented 3 years ago

@goedman These look great to me! Can I see the Pareto k values? My guess is that Stan's Pareto k values are just barely under 0.7 for these outliers, while Turing's are a little bit above. As long as the maximum difference doesn't exceed 0.1 or so I wouldn't be too worried.

goedman commented 3 years ago

Turing:

┌ Warning: Some Pareto k values are very high (>0.7), indicating that PSIS has failed to approximate the true distribution.
└ @ ParetoSmooth ~/.julia/dev/ParetoSmooth/src/LooStructs.jl:96
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -63.00 │     6.55 │ -1.26 │    0.13 │
│ naive_est │ -59.22 │     4.89 │ -1.18 │    0.10 │
│   overfit │   3.78 │     1.93 │  0.08 │    0.04 │
└───────────┴────────┴──────────┴───────┴─────────┘

Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -69.68 │     4.96 │ -1.39 │    0.10 │
│ naive_est │ -66.70 │     4.16 │ -1.33 │    0.08 │
│   overfit │   2.98 │     0.92 │  0.06 │    0.02 │
└───────────┴────────┴──────────┴───────┴─────────┘

[ Info: Some Pareto k values are slightly high (>0.5); some pointwise estimates may be slow to converge or have high variance.
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -63.69 │     6.44 │ -1.27 │    0.13 │
│ naive_est │ -59.03 │     4.68 │ -1.18 │    0.09 │
│   overfit │   4.66 │     1.90 │  0.09 │    0.04 │
└───────────┴────────┴──────────┴───────┴─────────┘

┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5_1t │   0.00 │  0.00 │   0.67 │
│ m5_3t │  -0.69 │  0.42 │   0.33 │
│ m5_2t │  -6.68 │  4.74 │   0.00 │
└───────┴────────┴───────┴────────┘

m5 1t m5 3t m5 2t

goedman commented 3 years ago

Stan:

┌ Warning: Some Pareto k values are very high (>0.7), indicating that PSIS has failed to approximate the true distribution.
└ @ ParetoSmooth ~/.julia/dev/ParetoSmooth/src/LooStructs.jl:96
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -62.96 │     6.46 │ -1.26 │    0.13 │
│ naive_est │ -59.23 │     4.86 │ -1.18 │    0.10 │
│   overfit │   3.73 │     1.85 │  0.07 │    0.04 │
└───────────┴────────┴──────────┴───────┴─────────┘

Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -69.73 │     4.97 │ -1.39 │    0.10 │
│ naive_est │ -66.71 │     4.14 │ -1.33 │    0.08 │
│   overfit │   3.02 │     0.96 │  0.06 │    0.02 │
└───────────┴────────┴──────────┴───────┴─────────┘

[ Info: Some Pareto k values are slightly high (>0.5); some pointwise estimates may be slow to converge or have high variance.
Results of PSIS-LOO-CV with 4000 Monte Carlo samples and 50 data points.
┌───────────┬────────┬──────────┬───────┬─────────┐
│           │  total │ se_total │  mean │ se_mean │
├───────────┼────────┼──────────┼───────┼─────────┤
│   loo_est │ -63.90 │     6.45 │ -1.28 │    0.13 │
│ naive_est │ -59.07 │     4.62 │ -1.18 │    0.09 │
│   overfit │   4.83 │     1.98 │  0.10 │    0.04 │
└───────────┴────────┴──────────┴───────┴─────────┘

┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5.1s │   0.00 │  0.00 │   0.72 │
│ m5.3s │  -0.94 │  0.36 │   0.28 │
│ m5.2s │  -6.77 │  4.65 │   0.00 │
└───────┴────────┴───────┴────────┘

m5 1s m5 3s m5 2s

goedman commented 3 years ago

Also did some more testing with Turing (not seeded). Particles summaries of chains do look ok and similar to Stan:

Turing runs:

(a = -0.00155 ± 0.1, σ = 0.821 ± 0.083, bA = -0.567 ± 0.11)
(a = 0.00156 ± 0.11, bM = 0.346 ± 0.13, σ = 0.952 ± 0.098)
(a = -0.00086 ± 0.097, bM = -0.0633 ± 0.16, σ = 0.826 ± 0.086, bA = -0.607 ± 0.16)
┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5_1t │   0.00 │  0.00 │   0.69 │
│ m5_3t │  -0.82 │  0.39 │   0.30 │
│ m5_2t │  -6.75 │  4.67 │   0.00 │
└───────┴────────┴───────┴────────┘
(a = 0.00164 ± 0.1, σ = 0.823 ± 0.084, bA = -0.566 ± 0.12)
(a = -0.000387 ± 0.11, bM = 0.349 ± 0.13, σ = 0.95 ± 0.099)
(a = -0.0015 ± 0.1, bM = -0.0596 ± 0.16, σ = 0.829 ± 0.088, bA = -0.607 ± 0.16)
┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5_1t │   0.00 │  0.00 │   0.68 │
│ m5_3t │  -0.75 │  0.35 │   0.32 │
│ m5_2t │  -6.43 │  4.73 │   0.00 │
└───────┴────────┴───────┴────────┘
(a = 0.000454 ± 0.1, σ = 0.822 ± 0.083, bA = -0.568 ± 0.11)
(a = -0.00252 ± 0.11, bM = 0.347 ± 0.13, σ = 0.949 ± 0.097)
(a = 0.000968 ± 0.1, bM = -0.0591 ± 0.16, σ = 0.829 ± 0.088, bA = -0.606 ± 0.16)
┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5_1t │   0.00 │  0.00 │   0.72 │
│ m5_3t │  -0.96 │  0.37 │   0.28 │
│ m5_2t │  -6.88 │  4.58 │   0.00 │
└───────┴────────┴───────┴────────┘
(a = 0.000295 ± 0.098, σ = 0.826 ± 0.087, bA = -0.568 ± 0.12)
(a = 0.00071 ± 0.11, bM = 0.346 ± 0.13, σ = 0.947 ± 0.094)
(a = 0.00109 ± 0.099, bM = -0.0605 ± 0.15, σ = 0.828 ± 0.087, bA = -0.606 ± 0.16)
┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5_1t │   0.00 │  0.00 │   0.70 │
│ m5_3t │  -0.85 │  0.36 │   0.30 │
│ m5_2t │  -6.62 │  4.52 │   0.00 │
└───────┴────────┴───────┴────────┘

Stan runs:

(a = -0.000515 ± 0.1, bA = -0.563 ± 0.11, sigma = 0.824 ± 0.084)
(a = -0.00297 ± 0.11, bM = 0.35 ± 0.13, sigma = 0.945 ± 0.097)
(a = 0.00047 ± 0.1, bA = -0.61 ± 0.16, bM = -0.0637 ± 0.16, sigma = 0.828 ± 0.089)
┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5.1s │   0.00 │  0.00 │   0.75 │
│ m5.3s │  -1.09 │  0.39 │   0.25 │
│ m5.2s │  -6.71 │  4.55 │   0.00 │
└───────┴────────┴───────┴────────┘
(a = 0.00189 ± 0.1, bA = -0.566 ± 0.11, sigma = 0.825 ± 0.086)
(a = -0.0014 ± 0.11, bM = 0.349 ± 0.13, sigma = 0.952 ± 0.1)
(a = 0.000481 ± 0.1, bA = -0.607 ± 0.16, bM = -0.0604 ± 0.16, sigma = 0.825 ± 0.086)
┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5.1s │   0.00 │  0.00 │   0.68 │
│ m5.3s │  -0.75 │  0.38 │   0.32 │
│ m5.2s │  -6.66 │  4.63 │   0.00 │
└───────┴────────┴───────┴────────┘
(a = 0.000583 ± 0.098, bA = -0.564 ± 0.12, sigma = 0.824 ± 0.087)
(a = -0.000339 ± 0.11, bM = 0.349 ± 0.13, sigma = 0.943 ± 0.095)
(a = 0.000904 ± 0.1, bA = -0.608 ± 0.16, bM = -0.0607 ± 0.16, sigma = 0.828 ± 0.087)
┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5.1s │   0.00 │  0.00 │   0.73 │
│ m5.3s │  -1.02 │  0.37 │   0.26 │
│ m5.2s │  -6.73 │  4.58 │   0.00 │
└───────┴────────┴───────┴────────┘
(a = 0.00394 ± 0.099, bA = -0.566 ± 0.11, sigma = 0.821 ± 0.085)
(a = 0.000164 ± 0.11, bM = 0.346 ± 0.13, sigma = 0.949 ± 0.1)
(a = 7.57e-5 ± 0.1, bA = -0.606 ± 0.16, bM = -0.0608 ± 0.16, sigma = 0.826 ± 0.088)
┌───────┬────────┬───────┬────────┐
│       │ d_PSIS │  d_SE │ weight │
├───────┼────────┼───────┼────────┤
│ m5.1s │   0.00 │  0.00 │   0.70 │
│ m5.3s │  -0.83 │  0.41 │   0.30 │
│ m5.2s │  -6.62 │  4.73 │   0.00 │
└───────┴────────┴───────┴────────┘

@itsdfish Sorry, didn't respond to your earlier question. Yes, all simulations use the same dataset.

itsdfish commented 3 years ago

@goedman, given that you have already compared Turing and Stan, I was wondering if you would be willing to add those additional tests?

goedman commented 3 years ago

Hi Chris ( @itsdfish @ParadaCarleton ) You mean adding StanSample to the test environment and running the exact same steps, as shown above?

goedman commented 3 years ago

Hi Chris (@itsdfish)

Not sure why you asked the question, but below is what I used.

Unfortunately this uses upcoming releases of both StatisticalRethinking and StanSample (both v4), which are substantial breaking releases. It is based on a KeyedArray chains format [draws, chains, params] and dropping the kwarg output_format in read_samples(model; output_format=...) in favor of read_samples(model, [:keyedarray, :dataframe, namedtuple, ...]; ...). As :keyedarray is the default, read_samples(model) will mostly suffice.

The changes in code are limited, but in examples and tests very substantial. As I'm traveling the next 3 weeks this will probably take most of this month for StanJulai and Sep for StatisticalRethinkingJulia. I'll keep the v4 branches on Github "reasonably" up to date.

using StanSample, ParetoSmooth, NamedTupleTools
using StatisticalRethinking

df = CSV.read(sr_datadir("WaffleDivorce.csv"), DataFrame);
scale!(df, [:Marriage, :MedianAgeMarriage, :Divorce])
data = (N=size(df, 1), D=df.Divorce_s, A=df.MedianAgeMarriage_s,
    M=df.Marriage_s)

stan5_1 = "
data {
    int < lower = 1 > N; // Sample size
    vector[N] D; // Outcome
    vector[N] A; // Predictor
}
parameters {
    real a; // Intercept
    real bA; // Slope (regression coefficients)
    real < lower = 0 > sigma;    // Error SD
}
transformed parameters {
    vector[N] mu;               // mu is a vector
    for (i in 1:N)
        mu[i] = a + bA * A[i];
}
model {
    a ~ normal(0, 0.2);         //Priors
    bA ~ normal(0, 0.5);
    sigma ~ exponential(1);
    D ~ normal(mu , sigma);     // Likelihood
}
generated quantities {
    vector[N] log_lik;
    for (i in 1:N)
        log_lik[i] = normal_lpdf(D[i] | mu[i], sigma);
}
";

stan5_2 = "
data {
    int N;
    vector[N] D;
    vector[N] M;
}
parameters {
    real a;
    real bM;
    real<lower=0> sigma;
}
transformed parameters {
    vector[N] mu;
    for (i in 1:N)
        mu[i]= a + bM * M[i];

}
model {
    a ~ normal( 0 , 0.2 );
    bM ~ normal( 0 , 0.5 );
    sigma ~ exponential( 1 );
    D ~ normal( mu , sigma );
}
generated quantities {
    vector[N] log_lik;
    for (i in 1:N)
        log_lik[i] = normal_lpdf(D[i] | mu[i], sigma);
}
";

stan5_3 = "
data {
  int N;
  vector[N] D;
  vector[N] M;
  vector[N] A;
}
parameters {
  real a;
  real bA;
  real bM;
  real<lower=0> sigma;
}
transformed parameters {
    vector[N] mu;
    for (i in 1:N)
        mu[i] = a + bA * A[i] + bM * M[i];
}
model {
  a ~ normal( 0 , 0.2 );
  bA ~ normal( 0 , 0.5 );
  bM ~ normal( 0 , 0.5 );
  sigma ~ exponential( 1 );
  D ~ normal( mu , sigma );
}
generated quantities{
    vector[N] log_lik;
    for (i in 1:N)
        log_lik[i] = normal_lpdf(D[i] | mu[i], sigma);
}
";

m5_1s = SampleModel("m5.1s", stan5_1)
rc5_1s = stan_sample(m5_1s; data)

m5_2s = SampleModel("m5.2s", stan5_2)
rc5_2s = stan_sample(m5_2s; data)

m5_3s = SampleModel("m5.3s", stan5_3)
rc5_3s = stan_sample(m5_3s; data)

if success(rc5_1s) && success(rc5_2s) && success(rc5_3s)

    nt5_1s = read_samples(m5_1s, :particles)
    NamedTupleTools.select(nt5_1s, (:a, :bA, :sigma)) |> display
    nt5_2s = read_samples(m5_2s, :particles)
    NamedTupleTools.select(nt5_2s, (:a, :bM, :sigma)) |> display
    nt5_3s = read_samples(m5_3s, :particles)
    NamedTupleTools.select(nt5_3s, (:a, :bA, :bM, :sigma)) |> display

    models = [m5_1s, m5_2s, m5_3s]
    loglikelihood_name = :log_lik
    loo_comparison = loo_compare(models)
    println()
    for (i, psis) in enumerate(loo_comparison.psis)
        psis |> display
        pk_plot(psis.pointwise(:pareto_k))
        savefig(joinpath(@__DIR__, "m5.$(i)s.png"))
    end
    println()
    loo_comparison |> display
end
#=
With SR/ulam():

   PSIS    SE dPSIS  dSE pPSIS weight

m5.1u 126.0 12.83 0.0 NA 3.7 0.67 m5.3u 127.4 12.75 1.4 0.75 4.7 0.33 m5.2u 139.5 9.95 13.6 9.33 3.0 0.00

=#

Statistical Rethinking uses the factor -2.

itsdfish commented 3 years ago

@goedman, sorry for the radio silence. I have been unusually busy with family and work. ParadaCarleton was interested in comparing the various Loo values between Turing and Stan and making a test. I noticed that the comparisons you were performing could form the basis for the requested tests. Thanks for sharing. I will work this into some tests as soon as I can.