Saving Posterior Summaries Revisited

itsdfish commented 6 years ago

Hi Rob-

You might recall that I created a feature request in https://github.com/goedman/Stan.jl/issues/46 for saving Stan's summary. At the time, Mamba sufficed. But now that CmdStan does not require Mamba, I was wondering if it might be a good idea to save Stan's output. Another benefit is that it makes output more comparable across different programs that interface with Stan.

Best regards,

Chris

bparbhu commented 6 years ago

Hi Chris,

I'm currently working on StanFeather for saving CmdStan output. I should have this up shortly as I'll be able to come back to working on it over the weekend but that will be our solution for saving out CmdStan output.

Let me know if you have questions about it as well or if you have any ideas surrounding formats please feel free to let me know.

Thanks again,

-Brian

itsdfish commented 6 years ago

Hi Brian-

Thanks for keeping me in the loop and for working towards a solution. In terms of format, I think it would be convenient to save the summary in matrix form that could be easily read into a DataFrame in which columns represent the various statistics (e.g. rhat, mean, etc.) and the rows represent parameters. Stan outputs other metadata, which could potentially be stored in a separate csv. Of course, there could be other formats, but I suggested this because it would be easy to read in for post processing.

goedman commented 6 years ago

Hi Brian and Chris,

Thanks for reminding me Chris. I hadn’t completely forgotten, but with all the issues surrounding Pkg3 certainly taken my eyes off the ball.

Brian, looking forward to see StanFeather.jl! Would be a major step forward.

If I understand it well, what Chris is asking for is a bit different. The Stan C++ utility stansummary produces a nice summary (see below for an example). We (CmdStan.jl) gets that as a string. What Chris is asking for is to convert that into a DataFrame and the ability to store it. In addition to StanFeather, which stores the posterior draws in a few formats (and maybe the summary DataFrame as well?).

In https://github.com/goedman/Stan.jl/issues/46 https://github.com/goedman/Stan.jl/issues/46 Chris also points out that the describe summary in Mamba is not always an exact replica of the stansummary report.

My current thinking is to maybe use Tamas Papp’s MCMCDiagnostics.jl as the starting point to construct Chris' summary DataFrame. At least for monitored variables?

As you might have noticed, the last few weeks have been a bit difficult in getting packages using Pkg3 features registered on METADATA.jl. This is very unfortunate. I’ve also struggled to get the documentation updates published as well, but that seems to work again, at least for master/latest versions.

Currently registered on METADATA are non-Pkg3 versions of CmdStan, StanDataFrames, StanMamba (empty shell for now) and StanMCMCChain. After this weekend all master versions of those will be Pkg3-ready again and should work fine by dev-ving the package paths to Julia 0.7, 1.0 or 1.1-dev.

I did add an option to CmdStan to specify output_format=:namedarray which will return a NamedArrays.jl version of a3d. Default will remain :array though.

My (StanJulia’s!) todo list is:

Complete StanMamba once Mamba is available on Julia 1.0 (noticed that Gadfly is close now)
Move all examples to example packages, e.g. StanMCMCChainExamples.jl and StanMambaExamples.jl
Create examples of individual examples as Pkg3 application projects (i.e. just a Project.toml and a Manifest.toml)
A couple of notebook examples.
Update documentation (and the new .travis.yml suggestions to use staged builds).
Study and use the MCMCChain features to update the existing examples in StanMCMCChainExamples.jl (a tiny bit has been done for Bernoulli.jl as a test).
Implement Chris’ request.

In addition to Brian’s StanFeather.jl, several other extensions to StanJulia need further attention. As I mentioned before, I am particularly interested in ideas along the planned update of the ARM book (maybe StanRegression and StanARM as in Andrew’s response below?).

So plenty of work still to do! (Caveat: I also need to spend time on work using finite elements as that needs to be published early next year.)

Rob J Goedman goedman@icloud.com

Inference for Stan model: bernoulli_model
4 chains: each with iter=(600,600,600,600); warmup=(0,0,0,0); thin=(2,2,2,2); 2400 iterations saved.

Warmup took (0.0094, 0.016, 0.014, 0.0097) seconds, 0.049 seconds total
Sampling took (0.020, 0.032, 0.032, 0.030) seconds, 0.11 seconds total

                Mean     MCSE  StdDev    5%   50%   95%  N_Eff  N_Eff/s    R_hat
lp__            -8.2  1.9e-02    0.76  -9.7  -7.9  -7.6   1565    13731  1.0e+00
accept_stat__   0.92  2.5e-03    0.12  0.64  0.97   1.0   2340    20522  1.0e+00
stepsize__      0.99  6.5e-02   0.092  0.89   1.0   1.1    2.0       18  9.4e+13
treedepth__      1.4  1.0e-02    0.49   1.0   1.0   2.0   2214    19421  1.0e+00
n_leapfrog__     2.5  3.9e-02     1.3   1.0   3.0   3.0   1064     9335  1.0e+00
divergent__     0.00  0.0e+00    0.00  0.00  0.00  0.00   2400    21052      nan
energy__         8.7  2.7e-02     1.0   7.7   8.4    11   1538    13488  1.0e+00
theta           0.34  3.3e-03    0.13  0.13  0.33  0.56   1571    13778  1.0e+00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

On Aug 24, 2018, at 11:18, Andrew Gelman <gelman@stat.columbia.edu> wrote:

Hi, Rob.  Volume 1 (Regression and Other Stories, by Gelman, Hill, and Vehtari) is almost finished and should be released soon, I guess in a few months.  Then we will finish Volume 2 (Advanced Regression and Multilevel Models, by Gelman, Hill, Goodrich, Gabry, and Vehtari), I guess that will get done in 2019.  Thanks for asking.
Andrew

On Sep 14, 2018, at 11:06, fisherc2 notifications@github.com wrote:

Hi Brian-

Thanks for keeping me in the loop and for working towards a solution. In terms of format, I think it would be convenient to save the summary in matrix form that could be easily read into a DataFrame in which columns represent the various statistics (e.g. rhat, mean, etc.) and the rows represent parameters. Stan outputs other metadata, which could potentially be stored in a separate csv. Of course, there could be other formats, but I suggested this because it would be easy to read in for post processing.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/StanJulia/CmdStan.jl/issues/17#issuecomment-421388466, or mute the thread https://github.com/notifications/unsubscribe-auth/AADL6OA7mgFBK10A1R5bBwSKpevoru3Vks5ua8YAgaJpZM4WpXJl.

goedman commented 5 years ago

Chris, I'm closing this issue after creating a new [issue](https://github.com/StanJulia/StanMCMCChain.jl/issues/2#issue-395707003 on StanMCMCChain).

I think describe() gives very similar results. I also noticed issues with save and restore of Mamba Chain objects so I'm implementing a new version in StanMCMCChain.

Rob

itsdfish commented 5 years ago

Sounds good. Thanks for the update.

goedman commented 5 years ago

Chris, long time ago you asked for an option to save the stansummary output. I've added that option now to CmdStan v5.0.3. After running a model you can call summary_df = read_summary(stanmodel, ProjDir) and you'll get a ChainDataFrame. You can query individual fields as is possible in MCMCChains: summary_df[:theta, :ess].

Not sure if you still need/use this, but I needed it anyway.

itsdfish commented 5 years ago

Thanks Rob. I think the ability to save and manipulate summaries will be useful for large scale simulations.

StanJulia / CmdStan.jl

Saving Posterior Summaries Revisited #17