Closed storopoli closed 2 years ago
m = @formula y ~ x; data=my_df
chn = sample(m, NUTS(), 2_000)
This macro syntax is not gonna work or at least is very confusing. Maybe @formula data=df y ~ x
.
Also, using the same name as GLM formula isn't the best, I think. How about @modelformula data=df y~x
or so?
cc @phipsgabler @cpfiffer
We've done a little bit of this already here: https://github.com/cpfiffer/BayesModels. Could be fairly quickly repurposed/improved/integrated.
Yeah! Nice! I wasn't aware of the BayesModel.jl
! You've solved almost everything with the @formula
and prior
.
I still think that it should return a instantiated model with the data and the user do whatever he wants to with it inside Turing.jl
. This is where my proposal differs from BayesModel.jl
but I am not married to the ideia.
Two things that I forgot to mention:
{rstanarn}
and {brms}
are MORE popular than Stan.I currently ask for students to present a Bayesian analysis complete with code and analyses as a final grade assessment in the Bayesian Statistics course. I could give them the option to translate a Stan case study to TuringGLM.jl
(respecting the license of course) and them I could, with a few polishing, turn it into Turing.jl
and TuringGLM.jl
case studies.
EDIT: the blm
function solves the macro syntax issue that @rikhuijzer described.
I was asked to comment on this issue but I'm not sure if I'm the right person to ask - I know the @formula
style is popular in different research communities and also among people with a background in R but I don't like it and hence I don't use it and try to avoid it (even in Julia), which also explains why I haven't used any of the linked R packages :smile:
I agree with @rikhuijzer that it would be confusing to have another @formula
macro which is different from StatsModels.@formula
(GLM does neither own nor define @formula
). However, I would not use a different macro name but I think the best option, both for maintenance and users, would be to just use StatsModels.@formula
. I don't think it's a good idea to create yet another DSL or macro if there exists already a well maintained alternative. E.g., StatsModels.@formula
already can be combined with data (such as data frames) and can also be extended: https://juliastats.org/StatsModels.jl/stable/internals/
So my main suggestion would be to work with StatsModels.@formula
, extend it when necessary (e.g., additional Turing-specific terms if needed), and only implement a function that constructs a Turing model from such a @formula
object and coupled data and is called by the user (i.e., something like turing_model(@formula(y ~ ....), data, priors)
, similar to the GLM example with lm
here: https://juliastats.org/GLM.jl/dev/#Categorical-variables). So basically something similar to BayesModels
it seems :smile:
A completely different approach would be to view the problem that this package would solve as a limited example of a symbolic syntax for PPLs, i.e., as a ModelingToolkit for PPLs. We discussed this before and I think there are great opportunities, also regarding interoperability of different PPLs. Here you would need only a Turing output of the ModelingToolkit representation. The advantage would be that it would be much more general, disadvantages could be that probably it takes more time to set up and design correctly and maybe it would not be helpful for students or users that want to use R syntax.
But in any case I don't recommend creating another different macro/DSL syntax.
@devmotion these are great contributions, thanks!
I also agree with you that the formula syntax is very popular and specific to certain communities (it has been used since the S, predecessor or R, in 1992).
I thought that the @formula
macro was created by GLM.jl
but it is from StatsModels.jl
and also MixedModels.jl
extends it to hierarchical models.
The ModelingToolkit.jl symbolic syntax is also very interesting, would be easy to build/maintain if we design it as a simple DSL to hand-off turing models to Turing.jl
.
From what I seem, I think that most of what we would use of StatsModels.@formula
is already defined in GLM.jl
and MixedModels.jl
. I still cannot find something that would be easily to specify Binomial models from Bernoulli using the @formula
. Since Bernoulli(p) == Binominal(1, p)
we could in some cases gain significant computation advantage by using a Binomial likelihood instead of a Bernoulli. {brms}
and R's glm
handles this with the trials
and cbind
respectively:
# Frequentist GLM Base R
glm(cbind(success, (n - sucess)) ~ 1, family = binomial, ...)
# brms Bayesian GLM
brm(success | trials(n) ~ 1, family = binomial, ...)
I much prefer the cbind
syntax over the pipe |
thing by accepting a matrix of [success n - success]
or to give the option for the user to pass the n
trials inside the Binomial(n)
likelihood.
Awesome project! I thought about something like this recently but didn't continue due to my time being limited... I somehow got a start of coupling MixedModels.jl and their formula extensions with Turing here (note that I'm not a modelling person, the code might not be a good implementation...).
The "problems" I hit there were
|
terms mean and the model matrix functions, but when using the package you also depend on the whole optimization machinery that you don't want to when using Turing instead.@formula 0 ~ ...
things.
- I was really keen on doint this with the new condition/decondition functionality of DPPL, not arguments, and tried to figure out a way of letting the user define a model in formula style without yet being conditioned on observations (e.g., if you just want to simulate the outcome). But I couldn't find a nice design for that yet -- that's why you see those
@formula 0 ~ ...
things.
What if we specify inside the turing_model
:
turing_model(@formula(y ~ ....), data; priors=DefaultPriors(), prior_only=false)
# or
turing_model(@formula(y ~ ....), data; priors=DefaultPriors(), likelihood=true)
Well @storopoli, I don't have much prior experience with this kind of modelling, but what I meant is the following: say I want to do some simulation. data
contains only columns of independent variables a
, b
. Then I'd like to instantiate something like a generative model
mymodel = turing_model(@formula(outcome ~ a + b), data; ...)
and sample outcome
from that (which is what simulate
in my code does). This should be the joint over outcome
and the regression parameters.
The problem is that there is no simple way of including the name of outcome
in the formula, because the model matrix instantiation then believes it should be in data
and wants to include it. (You can strip it off the formula object, but then you lose it...)
Only later I'd like to be able to say
chain = sample(mymodel | (; outcome), ...)
to sample from the posterior given an actual value for the dependent variable.
The formula syntax isn't made for the use case of specifying "only a right hand side" -- but that would be more elegant from a PPL design perspective, since parameter inference is only the special case of the conditioned model.
I understand, but I do not find that people who uses formula syntax uses that often. In fact, I have never seen what you are trying to do being used in a formula approach. The formula syntax has the preconception that all variables are defined in a data structure data
.
Alright, I'm probably a very atypical case, so you can ignore the point :). The formula style would always construct a conditioned model and require the outcomes to be present in the data, right?
To generate simulated outcomes, you could then always first instantiate a conditioned model with unused "dummy" outcomes, and then decondition
that and sample.
We could include something like we do currently in Turing for posterior and prior predictive checks. If the lhs (i.e. the y
in y ~ ...
) is missing
(the instantiated type) we could implement a simulated outcome. What about that?
I've never used brms, bambi, or really any other regression package, so I can't offer experienced input. But I do have some initial impressions:
@formula
macros, and if they're too limiting, then work to try to extend them. Using Symbolics/ModelingToolkit is interesting, but I'm not certain what it gets us unless there's a use for symbolic manipulation/simplifcation/rewriting of the formulas.turing_model
converting a formula to a DynamicPPL.Model
is convenient and modular. From that perspective, why be Turing-specific? One could also imagine a similar soss_model
converter.@sethaxen thanks for the feedback!
- I agree with @devmotion that it would be better to use existing
@formula
macros, and if they're too limiting, then work to try to extend them. Using Symbolics/ModelingToolkit is interesting, but I'm not certain what it gets us unless there's a use for symbolic manipulation/simplifcation/rewriting of the formulas.
Yes , I agree with you both. From what I was playing around in the weekend, we can easily extend StatsModels.@formula
by adding it as a dependency. The hierarchical random effect stuff (1 | group)
etc stuff is extended by MixedModels.jl
but the package has a huge dependency footprint. So this I (and anybody who wants to collaborate) would have to extend/adapt.
The Symbolics/ModelingToolkit I think that would not be a major priority right now, but we can keep it in the roadmap.
- What Julia usually uniquely brings to the table is composability of other packages. Are there unique opportunities for composeability in regression models? Are there regression models that are hard to code in brms/bambi that would be easier with Julia?
This is a great provocation and invitation for insight. I honestly don't know. The main focus now is to create an easy API for users coming from R/Stats background to migrate to Julia and Turing. But we'll definitely explore this avenue in the near future.
- Perhaps one answer to the above is that
turing_model
converting a formula to aDynamicPPL.Model
is convenient and modular. From that perspective, why be Turing-specific? One could also imagine a similarsoss_model
converter.
This also address the number 4. I cannot speak from Soss.jl
, since I am not aknowledge on it and haven't used it yet. But from what I reckon, it does uses AdvancedHMC.jl
and Turing samplers as AD backend (or am I wrong?). So it should be easy to create a SossGLM.jl
brother package and add the soss_model
function. Much of what we will learn in TuringGLM.jl
will be aplicable/transferable to SossGLM.jl
.
- One complication might be that different choices of model implementation might be preferred depending on the autodiff engine being used. This seems like something that can be considered later.
We will favor, for now, NUTS()
and ForwardDiff.jl
(Turing's default). Advanced users who need different AD backends or samplers will be advised to go straight to Turing.jl
ecosystem.
For Seth's remarks 4 and 5: that's a good point. It depends on how the "backend" PPLs are used.
Is the plan to create a separate model for each formula (@formula(y ~ a + b)
-> generated_model(y, a, b)
, and a
, b
are plugged in from data
), or to only bridge data frames to model matrices via formula application (fixed_model(y, X)
and X = modelcols(formula, data)
)?
In the latter case (which I think is preferable), you don't need to create a model for each formula, but have one sufficiently flexible model per "regression type". In which case I think crossing PPLs is easier, since the only contribution required is a good implementation of the fixed model, not code generation.
Yes , I agree with you both. From what I was playing around in the weekend, we can easily extend
StatsModels.@formula
by adding it as a dependency. The hierarchical random effect stuff(1 | group)
etc stuff is extended byMixedModels.jl
but the package has a huge dependency footprint. So this I (and anybody who wants to collaborate) would have to extend/adapt.
Indeed, the packages extending StatsModels don't tend to separate syntax extension and implementation cleanly. That's something that would be good to refactor, and probably rather tedious. I only looked at MixedEffects, and couldn't easily figure out which parts would be necessary to extract; also, they make some very peculiar assumptions about the formula application.
There's a fundamental reason why this is complicated: the StatsModels formula thing is designed in such a way that formula application (i.e., creation of model matrices) is always tied to model implementation at later stages (semantics time), because at some time you have to decide how to set up the model matrix.
If it were only MixedEffects.jl to adapt, it's probably doable, but I'm not familiar enough with that ecosystem how much else there is (polynomial regression? splines?). Is it feasible for them to move from one package "mixed effects regression with syntax and implementation" to "mixed effects syntax and meaning (apply_schema
)" plus multiple implementations?
I think that splines and polynomials should be in the roadmap. I still need to inspect deeply the apply_schema
and @formula
API. From what I saw, extracting the terms
and reading the data using Tables.jl
is "doable".
I will try to create an initial release with the feature list detailed above using just the @formula
from StatsModels and extending manually the hierarchical terms. Polynomials and splines could be added by the user doing a transformation in the underlying data
(i.e. DataFrames.transform
or by applying a vectorized function transformation on a specific Tuple/Array) before it being passed to @formula
.
Regarding Tables.jl, this reminds me of the point
Close interface with DataFrames.jl and CategoricalArrays.jl
in the feature list above. My suggestion would be to avoid any explicit dependency on DataFrames if possible and just support the Tables.jl interface (as StatsModels and @formula
already does apparently).
Yes, @devmotion that is exactly what I am looking for.
I am depending on CategorialArrays.jl
, though, to parse the categorical
vector and, most importantly, get the reference dummy variable from the levels
output. Is CategoricalArrays.jl
an OK dependency for that?
StatsModels removed the dependency on CategoricalArrays two years ago (https://github.com/JuliaStats/StatsModels.jl/pull/157) and replaced it with the DataAPI interface. For instance, levels
is defined in DataAPI, so depending on your use case you might be able to even avoid CategoricalArrays. DataAPI is also supported by other array types such as https://github.com/JuliaData/PooledArrays.jl, which maybe could be supported automatically if one uses the DataAPI interface. But I would not be worried too much about CategoricalArrays if DataAPI is not sufficient, it is a much smaller dependency than DataFrames.
Great! That's all I need them! DataAPI.jl
and Tables.jl
.
I will close this issue. I think I have everthing for now.
Thank you all!
Hello, all!
I was speaking with @yebai about how Turing could support the whole Bayesian Workflow following Bob Carpenter presentation at ProbProg 2021 in October, 2021. And I gave him an idea about how an easy interface of Turing could attract more users and be a great transition from users coming from a stats background and/or R background. I am deeply interested in this because I mainly lecture graduate students on applied social sciences and if I show either Stan code or Turing model code they would run away. Thus, I use the friendly RStan interface
{brms}
that uses the formula syntax (as you can see in this still on-going course YouTube playlist).I also use Turing a lot in my research and most of it are hierarchical/generalized linear models that could be easily specified by a formula. Thus, I also find myself having to copy and paste a LOT of boilerplate code to define a model.
I suggested this package and the following objectives, initial release, API design and roadmap. We discussed and we would really love feedback.
I am also commited to a long-term maintainance/development of TuringGLM, since this would solve all my lecturer and researcher needs. I am tired of having students complaining that they cannot install
{brms}
because of RStan Windows issues. And also I would move all my lecturing and research to Julia. Furthermore, I think this package has a great opportunity to attract users to Julia/Turing/Bayesian Statistics, which I am all up for.Objective of the Package
Uses
@formula
syntax to specify several different (generalized) (hierarchical) linear models in Turing.jl. It is defined byStatsModels.jl
and can be easly extended to GLMs (seeGLM.jl
) and to Hierarchical models (seeMixedModels.jl
).Heavily inspired by
{brms}
(uses RStan or CmdStanR),bambi
(uses PyMC3) andStatsModels.jl
.My ideia is just to create a syntactic sugar of model definition with a
@formula
macro that translates it to a instantiated Turing@model
along with the data. The user just need to callsample
,advi
or another future Turing function (e.g.pathfinder
) onto the returned model from@formula
and he is good to go. So this package would be easy to maintain and just focus on whats important: the@formula
->@model
translation.Discussion Points
@formula
syntax:turing_model
will return an instantiateTuring
model with thedata
, so the user has just to callsample
oradvi
etc in the model. Example:How the user specify custom priors?
Feature list for
0.1.0
release (public repo)DataFrames.jl
andCategoricalArrays.jl
length(unique(x))
of a group-level intercept/slopecategorical
vectors by reading thelevels
and using the first level as baseline.Gaussian
andTDist
(IdentityLink
)Bernoulli
,Binomial
(how would we accept a matrix of[success n - success]
) andGeometric
(LogitLink
)Poisson
andNegativeBinomial
(LogLink
)Distributions.jl
dependent):return
statement(1 | group)
(x_1 | group)
(1 + x_1 | group)
or(1 | group) + (x1 | group)
LKJ()
(OBS: fast implementation depends on TuringLang/Turing.jl/issues/1629) ???@formula
Roadmap
LogNormal
,Gamma
,Exponential
andWeibull
ZeroInflatedPoisson
andZeroInflatedNegativeBinomial
.ZeroInflatedBinomial
andZeroInflatedBeta
f1 = @formula y1 ~ x1
andf2 = @formula y2 ~ x2
.AR(p)
: AutoregressiveMA(q)
: Moving AverageARMA(p, q)
: Autoregressive Moving AveragePathfinder.jl
?Bijectors.jl