StatisticalRethinking.jl v3: Just a set of methods (such as plotcoef, precis, etc.)?

StatisticalRethinkingJulia / StatisticalRethinking.jl

Julia package with selected functions in the R package `rethinking`. Used in the SR2... projects.

MIT License

386 stars 32 forks source link

StatisticalRethinking.jl v3: Just a set of methods (such as plotcoef, precis, etc.)? #92

Closed goedman closed 3 years ago

goedman commented 4 years ago

In https://github.com/StatisticalRethinkingJulia/TuringModels.jl/pull/23#issuecomment-626664042 I suggested to end up with a StatisticalRethinking.jl to provide just a set of methods (such as plotcoef, precis, plotbounds, link, quap etc which are useful for all 3 ...Models.jl repos) and move the actual use of these methods to the specific mcmc model repositories. In the book these methods are typically explained in the Overthinking sections.

This is a significant change in direction so I wanted to post very early on to get feedback.

Someone who wants to take the StatisticalRethinking course and is interested in Turing would use either TuringModels.jl (for just the models) or a still to be created StatisticalRethinkingTuring.jl. Or these 2 repos could be combined.

Similarly, for Stan it would mean a lot of the current contents of StatisticalRethinking.jl would either move to StanModels.jl (basically all models) and the core course materials would end up in StatisticalRethinkingStan.jl.

Just a thought for now as more and more folks (rightly so!) seem interested in a Turing specific version.

goedman commented 4 years ago

Some initial pros and cons why this might be a good/not so good idea:

Pros:

StatisticalRethinking.jl methods can be reused in all three mcmc model repos (DynamicHMCModels, TuringModels and StanModels). And also in other contexts, e.g. recently I was looking a bit into DiffEqBayes and wished I had plotcoef() easily available. But in its current form, StatisticalRethinking is too big.
From chapter 5 onwards I found I really wanted to hold on to results of several models and ended up adding many model clips, e.g. m5.1.jl etc. This started a line of thought to structure e.g. StanModels.jl in such a way that I can refer to that model with something like include(stan_model_path("..", "models", "05", "01", "s", ".jl")) and with the actual results as m05_01_s in calls such as p = read_samples(m05_01_s, output_format=:particles).
For Turing the "s" would be replaced by a "t", for DynamicHMC with a "d".
Much more debate on how to improve methods such as plotcoef.

Cons:

Model formulations live outside StatisticalRethinking.jl or e.g. StatisticalRethinkingTuring.jl

karajan9 commented 4 years ago

I'm not quite sure what's yet to come in the book and the full extend of the repo but here are some thoughts, take them with a grain of salt. There seems to be quite a bit of different content:

Models from (for now) three different frameworks
(generally useful) tools for working with the results
Code for working with the results. Since (I think) both Turing and DHMC return MCMCChains at least for those it might be similar.
Code, independent of the models (the DAG stuff, I couldn't find an elegant way to do the posterior predictive checks in chapter 3 and what else have you)
General tips for doing this kind of stuff in Julia, like DataFrames, plotting, etc. (I guess some specifics but mostly links to existing documentation?). For example a line like y .~ Normal.(mu, s) in Turing makes sense to me but without knowledge of broadcasting via dots it must look either like magic or hieroglyphs.
???

All that should make sense from a perspective of

Reusing neat tools elsewhere
Keeping everything maintainable
Have everything easily installable
Make it possible to go though the repo alongside the book without having to search where the next code block might be hidden
???

So I think:

Generally useful tools could go in an extra package independent of StatisticalRethinking with a more fitting name (since they aren't really bound to the book if much more people might find them useful). This package could be imported in StatisticalRethinking and used from there.

From a working-with-the-book perspective having a single repo might be best. That allows just going through it chapter by chapter, depending which framework you like you use model05_01_s or model05_01_t and can easily compare them since they would be right next to each other. That would also answer the question where everything goes: in there. Since that repo is only to be used while working with the book it doesn't really matter if it's heavy. However, it might make everything more messy and unwieldy. It might come down to how much of the repo is models vs. the rest. Having a separate TuringModels would make it easier to point someone without connection to the book there for some inspiration, on the other hand I'm not sure how much good these models can do without the context of the book.

goedman commented 4 years ago

Thanks @karajan9

Many good points in your response, I want to think through the consequences of your suggestions a bit more.

But for now, I will not touch TuringModels without letting you know, so if you want, you can add/improve to that repo and if later on (very likely anyway) we want to refactor it, that's ok as well.

goedman commented 4 years ago

On May 12, 2020, at 16:01, karajan9 <notifications@github.com mailto:notifications@github.com> wrote:

I'm not quite sure what's yet to come in the book and the full extend of the repo but here are some thoughts, take them with a grain of salt. There seems to be quite a bit of different content:

Models from (for now) three different frameworks True. (generally useful) tools for working with the results Code for working with the results. Generally useful tools could go in an extra package independent of StatisticalRethinking with a more fitting name (since they aren't really bound to the book if much more people might find them useful). This package could be imported in StatisticalRethinking and used from there.

I have a slight preference to work towards making StatisticalRethinkingJulia mcmc package neutral. I would like to finish chapters 7 and 8 first, but then parallel to your work I could experiment with a new package StatsiticalRethinkingStan.jl while storing the models in StanModels.jl.

Although tools like plotcoef, link, precis, etc in the end might be of interest outside the SR context, the ideas did come from Richard (McElreath) and by keeping them in a base package named StatisticalRethinking.jl honors this. Maybe until some of them, once mature and proven widely useful, migrate to other packages like StatsPlots.jl?

Code, independent of the models (the DAG stuff, I couldn't find an elegant way to do the posterior predictive checks in chapter 3 and what else have you)

If I can pull off a reasonable version of StructuralCausalModels.jl I was planning to use that for the DAG stuff (to me a key part of the SR book).

I’ll take a look at the posterior predictive checks in chapter 3, I had the impression the current StatisticalRethinking.jl covered that reasonably well.

General tips for doing this kind of stuff in Julia, like DataFrames, plotting, etc. (I guess some specifics but mostly links to existing documentation?). For example a line like y .~ Normal.(mu, s) in Turing makes sense to me but without knowledge of broadcasting via dots it must look either like magic or hieroglyphs. ??? Since (I think) both Turing and DHMC return MCMCChains at least for those it might be similar. I think MCMCChains is very useful to check chains. Right now I have not found it particularly easy to use for subsequent work. Turing returns MCMCChains.Chains objects. DynamicHMC returns NamedTuples. For Stan.jl it is optional, I mostly use Particles and DataFrames these days.

Recently a lot of good work has been done on MCMCChains.jl. Once this work settles down and the Array and DataFrame conversion functions are still in there, it provides an easy way to switch to e.g. DataFrames and Arrays.

All that should make sense from a perspective of

Reusing neat tools elsewhere Keeping everything maintainable Have everything easily installable Make it possible to go though the repo alongside the book without having to search where the next code block might be hidden ??? So I think:

From a working-with-the-book perspective having a single repo might be best.

That would also answer the question where everything goes: in there.

Agreed. That allows just going through it chapter by chapter, depending which framework you like you use model05_01_s or model05_01_t and can easily compare them since they would be right next to each other.

Maybe models should be named as identical as in the book, except we can’t use dots. Thus m5_1t in file m5.1t.jl? Since that repo is only to be used while working with the book it doesn't really matter if it's heavy.

Agreed as long as we don’t run into CI problems again. However, it might make everything more messy and unwieldy. It might come down to how much of the repo is models vs. the rest. Having a separate TuringModels would make it easier to point someone without connection to the book there for some inspiration

Time will tell, refactoring is ok and all in the game. But even in that case, storing models in separate files in a single repo would allow that. , on the other hand I'm not sure how much good these models can do without the context of the book.

True.

One final point, I would really like to keep the StatisticalRethinkingJulia repos a course, particularly with respect to the exercises in the book.

Rob

karajan9 commented 4 years ago

I have a slight preference to work towards making StatisticalRethinkingJulia mcmc package neutral. I would like to finish chapters 7 and 8 first, but then parallel to your work I could experiment with a new package StatsiticalRethinkingStan.jl while storing the models in StanModels.jl.

I don't really have an opinion on this yet. I think it depends on how much of the material is MCMC neutral or could have a common interface (DataFrames? Particles?) if it makes sense.

For example, I wasn't aware that quap currently depends on Stan. I wrote a short version that takes in a Turing model to get the NLL and returns the MAP via Optim.jl as well as the hessian, akin to what Richard does. (I'm thinking about making a PR when I find a good place for it.)

He also wrote a parser so he can have a math-like DSL instead of relying on Stan. I'm not sure how much motivation I have to do it the same way, considering Turing models already look pretty close (just upside down) and will be needed later on anyway (at least for me). But that would then mean that the Stan/Turing path splits even earlier.

Although tools like plotcoef, link, precis, etc in the end might be of interest outside the SR context, the ideas did come from Richard (McElreath) and by keeping them in a base package named StatisticalRethinking.jl honors this. Maybe until some of them, once mature and proven widely useful, migrate to other packages like StatsPlots.jl?

Hm, I can see that point (although I can't really judge how much of anything he came up with since I have no history in the field). If these methods are widely useful I think at least making them easily accessible (without heavy dependencies like Stan or Turing) would be beneficial. Putting them in a package like StatsPlots might increase reach but I don't think that's a must at all.

If I can pull off a reasonable version of StructuralCausalModels.jl I was planning to use that for the DAG stuff (to me a key part of the SR book).

To be honest, I've gotten a little lost with all the packages related to this. Again, I think I have to invest more time just going through everything.

Recently a lot of good work has been done on MCMCChains.jl. Once this work settles down and the Array and DataFrame conversion functions are still in there, it provides an easy way to switch to e.g. DataFrames and Arrays.

I'm not quite sure what all the alternatives might be but for working with data/samples DataFrames is probably both a save and good bet.

Maybe models should be named as identical as in the book, except we can’t use dots. Thus m5_1t in file m5.1t.jl?

Just curious, why can't there be any dots?

Agreed as long as we don’t run into CI problems again.

I really don't have any idea about CI (except the basic concept) sooo... I'm just going to follow your lead here.

One final point, I would really like to keep the StatisticalRethinkingJulia repos a course, particularly with respect to the exercises in the book.

What do you mean by that?

goedman commented 4 years ago

I'm not yet at a point where I see a clear value of the parser work. As you have found as well, to 'translate' to Turing (and in fact to Stan) the sequence is reversed. As I like Turing's PPL I considered a translation from that to Stan (somewhat like it is done in DiffEqBayes.jl) but didn't go there at that point in time. With your work that might change.

I'm also still on the fence about quap(). Early on I show a few ways of computing the MAP, none of which I found completely satisfactory.

Mohammed provided me a MAP version for Turing 2 years ago but at that time Turing simply took too long for even the simple models in SR. Now Turing/AMHC for SR models is mostly on par I would be very interested in your version. Why not include it in TuringModels.jl's src directory?

For Stan I tried Optim.jl, StanOptimize.jl and Stan's logpdf formulation, but decided to just fit a Normal distribution to Stan samples around the max density. If the value of having a proper quap() would become more clear I might have to fix this.

Dots in Julia are for field access, a variable m5.1 will be rejected.

By my final point I mean that the repo should not contain pre-cooked solutions for all exercises. The repo needs to provide tools that help finding solutions without having to understand all ins and outs of e.g. Julia graphic packages, Julia's Distributions.jl, etc.

karajan9 commented 4 years ago

As I like Turing's PPL I considered a translation from that to Stan (somewhat like it is done in DiffEqBayes.jl)

Since I don't know the internals of DiffEqBayes, can you explain what that would mean?

Dots in Julia are for field access, a variable m5.1 will be rejected.

Oh yes, I thought you were talking about file names here.

By my final point I mean that the repo should not contain pre-cooked solutions for all exercises.

I'm not sure what I think about that yet. I agree that students should be incentivized to actually work through the exercises, on the other hand, here they don't have the opportunity to get their results checked or to discuss and compare with classmates like in a real course. I have done the exercises so far, I'm quite confident in some results, for some I'd like to cross-check my solutions but there isn't really a way to do this except asking on Discourse etc. which isn't really ideal either. Maybe a separate repository linked here could be a sensible compromise?

goedman commented 4 years ago

DiffEqBayes contains a function stan_string() that maps (by string interpolation) distributions (from Distributions.jl) to the Stan language equivalents. This only deals with priors and likelihoods, most other stuff needed in a Stan program comes from a template. Richard's R code is quite a number of pages to do the complete job.

Let's shelve the exercise discussion for now, do what you would like to do. Once a bit further down the road we might have a better understanding about what is ok.

I still need to come back to your earlier (very useful) remark that you got a bit lost in all the packages!

karajan9 commented 4 years ago

Let's shelve the exercise discussion for now, do what you would like to do. Once a bit further down the road we might have a better understanding about what is ok.

Sounds like a good idea.

I started working through the book in Turing and collected everything here: https://github.com/karajan9/statisticalrethinking

I try to translate pretty much everything so my goal is to use the resulting code to fill the holes in TuringModels. I'm not quite sure yet how that's going to work best since I started from scratch with no regard for what there already.

Doing that made it clearer to me that splitting up the package or collecting out the generally useful functions in a separate package would probably be a good idea. Currently I would be using StatisticalRethinking just for the precis function (I'm only up to chapter 5 and half of the functions (link or sim) don't work as well with the Turing workflow) and for that I get a ton of dependencies installed, including all things Stan which I'm probably not going to use at all. I'm not sure yet, but it could also happen that there will be less overlap between the Stan, Turing, ... versions because the helper functions need to work differently. quap already got a Turing specific version, sim could work well with the newly released posterior prediction functionality of Turing.

Then again, this is only relevant if the package is intended to be used as a dependency and not exclusively as a "stand-alone package".

I still need to come back to your earlier (very useful) remark that you got a bit lost in all the packages!

As far as I can see right now (again, I'm only up to chapter 5 so things might look different for that latter chapters) the map looks like: StatisticalRethinking: many models in Stan, helper functions and additional translations from the R code with resulting plots TuringModels: some of the models translated to Turing but not much else DynamicHMC: similar StanModels: not sure, is this going to be a TuringModels equivalent for Stan? StructuralCausalModels: for working with DAGs

I think the main point for me was that for working with Turing I need a non obvious mix of StatisticalRethinking and TuringModels and even then there isn't everything there (just because it hasn't been translated yet but that wasn't clear to me).

goedman commented 4 years ago

Great input @karajan9. Some thoughts:

If you only use precis I would suggest to copy it to TuringModels.
Common functionality could go in a new package, e.g. StatisticalRethinkingCore.jl. That would then become a dependency for other SR packages.
I don't know (yet) if the workflows are sufficient similar to have useful content for this SRCore.jl package. Withquap it's clear that the Turing quap method might be a thin layer, and maybe the same is true for sim, but I don't think there is a reason why e.g. link could not be shared. Time will tell. I definitely don't like the current implementations of sim and 'link`.
I still think it would be nice to have just the models in a separate package to easily allow comparisons between models, use the models from within other packages such as StructuralCausalModels.jl, and maybe make the main package, StatisticalRethinking.jl, at some point mcmc implementation agnostic (based on an ENV setting and/or using Requires.jl?).
Anyway, my main conclusion is that it is very interesting to see where you will end up. It has never been my intention to provide translations for all of the R functionality, just basically to help folks get started with SR in Julia. But all of that might change given your feedback!

goedman commented 4 years ago

For now I'm closing this issue. For now I have decided to not split StatisticalRethinking.jl in a set of common components and move all scripts to a repo like StatisticalRethinkingStan.jl. It is a very breaking change and I'm not sure how substantial that set of common components is.

karajan9 commented 4 years ago

Hi Rob, sorry for taking so long for getting back to you -- I could have sworn I already responded, but alas, here we are.

I think you are right, a common package might more sense at a later time -- if at all -- when it's more clearly a good idea. If you are now populating StatisticalRethinkingStan, what do you plan to become of StatisticalRethinking?

karajan9 commented 4 years ago

It has never been my intention to provide translations for all of the R functionality, just basically to help folks get started with SR in Julia. But all of that might change given your feedback!

Well, this started because of necessity: if I want to run the models I need most of the other code as well -- why not do the whole thing then. I also think this would be good for people who are not familiar with Julia or don't want to dive as deep down. This gives the same chance as with R to just play around with the existing code.

goedman commented 4 years ago

For now I'm planning to keep StatisticalRethinking.jl up to date/operational, even if it is just for consistency for current users.

Once we figure out if we need a place to store common functions that might be a long term direction for it. Right now StatisticalRethinking.jl is a package. I think for users to run/modify the scripts (and Pluto notebooks) a project that can be copied is a better tool (as in your setup).

If it turns out users like projects like StatisticalRethinkingTuring (basically what you are working on), StatisticalRethinkingStan and maybe others and we don't need above common functions repo, we might just might simply phase it out or e.g. use it for comparisons between Julia mcmc options. Up to us or our users.

karajan9 commented 4 years ago

Sounds like a good plan and the new Turing repo looks great! I'm especially curious about those Pluto notebooks, I haven't got around to trying them out yet. Do you think you are happy with how things are looking at the moment? I'm slowly (very) working ony my code but if the approaches are so similar I'd be happy to contribute code here directly.

goedman commented 4 years ago

Well, there are a lot of really nice ideas in your work so I'm (i think) in the tail end of merging those ideas in the 2 mcmc versions.

Couple of the outstanding issues are:

Intro stuff and comparing with R should all move to Pluto notebooks and more interactive.
User should be able to run scripts from both repos in a single REPL session to compare results (e.g. quap() is a good example).
The src directory in StatisticalRethinking.jl v3.0.0 is still a mess. I'll clean that up later this month.
Relationship between precis and Particles (I think Richard also uses precis as a structure).
Only (important?) book figures will be stored in the plotsdir(). Others are either in Pluto notebooks or displayed live.

Still fine tuning all of that.

This weekend I'm hoping both mcmc versions are complete and reformatted up to chapter 3, including several notebooks. You're welcome to contribute wherever you feel comfortable. Like precis() in Statisticalrethinking.jl. But I'm also fine to use my cycles to do that work and you look at chapters 6, 7 and 8.

Do you have a preference how I should refer to you, e.g. in authors, acknowledgements?