bambinos / bambi

BAyesian Model-Building Interface (Bambi) in Python.
https://bambinos.github.io/bambi/
MIT License
1.06k stars 121 forks source link

Design how we're going to extend Bambi #544

Open tomicapretto opened 2 years ago

tomicapretto commented 2 years ago

The following is a list of features we're missing (or covering only partially) in Bambi

The last three points (survival/censored, ordinal, and zero/zero-one inflated) are covered by the first points (distributional and multivariate) if we implement them appropriately. The third point, non-linear models, is a separate problem. I'll try to add a couple of things I've been thinking about lately.


Distributional models

Some API proposals

formula = bmb.formula(
    "y ~ a + b",
    "sigma ~ a",
)
priors = {
    "a": bmb.Prior("Normal", mu=0, sigma=1),
    "b": bmb.Prior("Normal", mu=0, sigma=1),
    "sigma_Intercept": bmb.Prior("Normal", mu=0, sigma=1),
    "sigma_x": bmb.Prior("Normal", mu=0, sigma=1)
}
link = {"mu": "identity", "sigma": "log"}
model = bmb.Model(formula, data, priors, link)

I haven't thought much more about the implementation details, where other concerns may appear. For the moment, I think it's good to discuss about the API we want. Any objections, any suggestions, any drawbacks I'm not seeing?

Multivariate models

We currently support some multivariate families, such as "categorical" and "multinomial". I feel we should think more about the implementation. I think we could make it more general so we don't need to handle all cases as special cases. With that said, I think there are other things to discuss.

"c(y1, y2, ..., yn) ~ ..."
"mvbind(y1, y2, ..., yn) ~ ..."
bmb.formula("y1 ~ ...", "y2 ~ ...", "y3 ~ ...")

note the last alternative allows for different predictors to be included in each case.

I'm not an expert in this area but I have the feeling that things can get very complex very quickly. And I'm not sure if this is a highly required feature.

For now, I tend to think we should have minimum support that allows people and us to explore the possibilities available as well as refine the API.

Non-linear models

This has been discussed a little here #448. I think it's a very nice to have feature but I don't have it solved in my mind yet. The only thing I have are some API proposals, but I don't see how to implement them without a huge effort.

First:

formula = bmb.formula(
    "y ~ b1 * np.exp(b2 * x)",
    nlpars=("b1", "b2")    
)

But this comes with a major problem, how do we override the meaning of the * operator in the formula syntax? If we pass something like that to formulae, it won't multiply things by b1 or b2, it will try to construct full interaction terms between the operands. I like how this approach looks but it would require a huge amount of effort to parse terms and parameters.

Another alternative would be to use a function.

def f(x, b1, b2):
    return b1 * np.exp(b2 * x)

formula = bmb.formula(
    "y ~ f(x, b1, b2)",
    nlpars=("b1", "b2")    
)

This would work on the formulae side, but again we would need to do parsing stuff to grab the non-linear relationship between the parameters (b1 and b2) and the predictor x. How do we handle arbitrarily complex functions? I'm not sure.

Survival models/Models with censored data.

543 adds support for survival analysis with right-censored data. One drawback of the proposal is that family="exponential" always implies right-censored data. I think we should have something more general.

I imagine all the following cases working

bmb.Model("y ~ ...", data, family="exponential")
bmb.Model("censored(y, status) ~ ...", data, family="exponential")
bmb.Model("censored(y, status, 'left') ~ ...", data, family="exponential")

The challenge is that censored() should be a function that returns an array-like structure (so formulae knows how to handle it) with some attribute that enables Bambi to figure out the characteristics of the censoring. I'm not sure how to implement this but I know it's feasible.

Ordinal models and Zero and Zero-One inflated models.

I think these ones come almost for free if we do a good job with the tasks above.

canyon289 commented 2 years ago

Imo I think non linear models given its effort should be lowest priority

tomicapretto commented 2 years ago

Imo I think non linear models given its effort should be lowest priority

I agree. It would be a lot of work without knowing if it will work.

aflaxman commented 2 years ago

Imo I think non linear models given its effort should be lowest priority

I agree. It would be a lot of work without knowing if it will work.

One route to nonlinear models (and perhaps some of the other items on your exciting list) would be to make a "transcompiler" similar to the request in #539 ; I could imagine using this in an exploratory research workflow to start with a linear model, generate python code for it, and then iteratively modifying the autogenerated code to include nonlinear elements.

aloctavodia commented 2 years ago

I wonder if defining the non-linear function in Aesara could help us to get the necessary information in a general way (@ricardoV94, @rlouf). For the record, I try to solve #539 by inspecting the PyMC model, but then I abandoned that route and moved into using the structure information already available in Bambi (that is still WIP).

rlouf commented 2 years ago

You can definitely analyze any aesara graph a user provides and see if it matches a certain pattern. May if I knew better how bambi works I could tell you if it would be useful in this case.

canyon289 commented 2 years ago

Other than the technical implementation frankly I dont think itll be all that usefull and theres not a huge userbase for it. If people want non linear models they can just use PyMC to code those up.

The other use cases imo are much easier to implement in Bambi and will have a wider userbase.

don-jil commented 2 years ago

I like your proposed API for censored data. Very clean.

tomicapretto commented 1 year ago

I have new ideas for distributional models

Instead of this

formula = bmb.formula(
    "y ~ a + b",
    "sigma ~ a",
)
priors = {
    "a": bmb.Prior("Normal", mu=0, sigma=1),
    "b": bmb.Prior("Normal", mu=0, sigma=1),
    "sigma_Intercept": bmb.Prior("Normal", mu=0, sigma=1),
    "sigma_x": bmb.Prior("Normal", mu=0, sigma=1)
}
link = {"mu": "identity", "sigma": "log"}
model = bmb.Model(formula, data, priors, link)

have this (notice the priors)

formula = bmb.formula(
    "y ~ a + b",
    "sigma ~ a",
)
priors = {
    "y": {
        "a": bmb.Prior("Normal", mu=0, sigma=1),
        "b": bmb.Prior("Normal", mu=0, sigma=1),
    },
    "sigma": {
        "Intercept": bmb.Prior("Normal", mu=0, sigma=1),
        "a": bmb.Prior("Normal", mu=0, sigma=1)
    }
}
link = {"mu": "identity", "sigma": "log"}
model = bmb.Model(formula, data, priors, link)

It adds more structure and prevents us from having to parse strings to decide to which response the prior corresponds to. Also, "_" is very common in variable names, so it's highly likely we get it wrong.

This does open new questions though. For example

zwelitunyiswa commented 1 year ago

Nice. I like that structure - very clear. Distributional models are a very cool addition!

On Sat, Oct 22, 2022 at 15:08 Tomás Capretto @.***> wrote:

I have new ideas for distributional models

Instead of this

formula = bmb.formula( "y ~ a + b", "sigma ~ a", )priors = { "a": bmb.Prior("Normal", mu=0, sigma=1), "b": bmb.Prior("Normal", mu=0, sigma=1), "sigma_Intercept": bmb.Prior("Normal", mu=0, sigma=1), "sigma_x": bmb.Prior("Normal", mu=0, sigma=1) }link = {"mu": "identity", "sigma": "log"}model = bmb.Model(formula, data, priors, link)

have this (notice the priors)

formula = bmb.formula( "y ~ a + b", "sigma ~ a", )priors = { "y": { "a": bmb.Prior("Normal", mu=0, sigma=1), "b": bmb.Prior("Normal", mu=0, sigma=1), }, "sigma": { "Intercept": bmb.Prior("Normal", mu=0, sigma=1), "a": bmb.Prior("Normal", mu=0, sigma=1) } }link = {"mu": "identity", "sigma": "log"}model = bmb.Model(formula, data, priors, link)

It adds more structure and prevents us from having to parse strings to decide to which response the prior corresponds to. Also, "_" is very common in variable names, so it's highly likely we get it wrong.

— Reply to this email directly, view it on GitHub https://github.com/bambinos/bambi/issues/544#issuecomment-1287886498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH3QQVY2LAPXLAPSQZY5YBLWEQ3SZANCNFSM53HNB7OQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

tomwallis commented 1 year ago

Other than the technical implementation frankly I dont think itll be all that usefull and theres not a huge userbase for it. If people want non linear models they can just use PyMC to code those up.

The other use cases imo are much easier to implement in Bambi and will have a wider userbase.

I would like to chime in on the nonlinear models front. At least for people in my research field this would be a very welcome feature. Computational cognitive models are often nonlinear (and sometimes distributional). Popular examples of the latter are things like drift diffusion models, but I most commonly work with univariate psychometric functions of the form

$$\Psi(x; m, w, \gamma, \lambda) = \gamma + (1 - \gamma - \lambda) S(x; m, w)$$

where $S$ is a sigmoidal function like a cumulative Gaussian, $m$ and $w$ are "threshold" and "width" parameters (location and scale), and $\gamma$ and $\lambda$ "squish" the function between upper and lower bounds between zero and one. $\Psi$ is the probability of a response (e.g. a correct response) as a function of a stimulus level $x$, usually through a binomial distribution. Note that the model is nonlinear because $\gamma$ and $\lambda$ can be free parameters; if these were fixed to 0 and 1 respectively then this would boil down to a logistic / probit regression. These models are also very related to models from item response theory.

While it's true that one can build these models by hand in PyMC, the convenience of a wrapper package like bambi is to be able to quickly build and test versions of the models with different LMM structures on the different nonlinear parameters, without handling all the sampler optimisation tricks and interpretation / plot wrappers every time. This is a great boon for fast and robust model development and comparison, as well as being able to have a relatively gentle tool for students to learn.

I've previously used brms for this work, but being an almost-pure-python user nowadays I'd love to have something similar available. While I don't know the backend of Bambi, I do like the API for nonlinear models in brms. I'm not sure if these examples would help inspire design ideas.

formula = bmb.formula(
    "y ~ b1 * np.exp(b2 * x)",
    nlpars=("b1", "b2")    
)

But this comes with a major problem, how do we override the meaning of the * operator in the formula syntax? If we pass something like that to formulae, it won't multiply things by b1 or b2, it will try to construct full interaction terms between the operands.

The brms workaround for this is basically to have a nl = True flag in the appropriate place (or here perhaps an if nlpars is not None) that can be used to change the parsing logic (i.e. count * as a multiply operator). The (G)LMMs could then be built on top of each nonlinear parameter as an extra argument:

formula = bmb.formula(
    "y ~ b1 * np.exp(b2 * x)",
    b1 ~ 1 + (1 | g1), 
    b2 ~ 1 + (1 | g2\g1),
    nlpars=("b1", "b2"),
)

Please let me know if this would be of further interest.

tomicapretto commented 1 year ago

@tomwallis first of all thanks a lot for chiming in!

We're really close to having distributional models! I need some extra time to polish the details https://github.com/bambinos/bambi/pull/607 and then we're done :)

For the non-linear part, I still anticipate non-trivial problems. I'm not saying it's impossible at all though. I'm already aware of the syntax to specify non-linear parameters in brms, but I couldn't study the details about how they are handled internally. I suspect it's easier to manipulate the expression in R, because of the meta-programming capabilities of the language, and in Python, we would need to create a larger machine to do so.

For example, let's say we have the expression b1 * np.exp(b2 * x) + np.log(c * y). We need machinery that allows us to pass to formulae that the variable that the expression that goes the design matrix creation is only x + y, but we also need to store a reference to the full expressions so we can handle the priors and the PyMC computation. And to do this in a general way is not trivial.

But anyway, thanks a lot for your input, I'm really happy to see there are more people interested in this feature. I will be pinging you when I start working on it if you want :)

tomicapretto commented 1 year ago

To get nonlinear terms working I think we need to do the following

  1. Extract individual terms, considering only the sum + as a splitting operator DONE https://gist.github.com/tomicapretto/b938614811774b8ab4d9127bdb92e138
  2. Detect which terms use non-linear parameters TO DO
  3. Manipulate the expression in terms where non-linear parameters play a role to pass only the 'variable'/'predictor' to formulae, at the same time we keep the expression somewhere so it can be used later (e.g. when adding the non-linear term to the PyMC model) TO DO
tomwallis commented 1 year ago

Thanks @tomicapretto for your positive reply! Yes, please ping me back when you start to invest time.

paul-buerkner commented 1 year ago

These look like nice features indeed! If you have any questions about brms' internal structure and procedures of representing some of the more advanced models, please let me know.

tomicapretto commented 1 year ago

@paul-buerkner definitely I would love it!

How would you prefer to do it? I could...

I think that for specific technical questions an issue may be a good alternative (since it's also public). But if at some point the conversation turns to other broader topics such as design choices, we could have a meeting.

What do you think?

paul-buerkner commented 1 year ago

Opening issues here in the bambi repo and tagging me sounds good. And I agree we can have a meeting at some point if we feel it could be useful.

tomicapretto commented 1 year ago

@paul-buerkner here's my first question :)

What's the difference between a "Distributional model" and a "GAMLSS" (Generalized additive model for location, scale and shape? I took the name "Distributional models" from brms. But as far as I understand they are very close to GAMLSS, perhaps the difference is in the GAM part, and the distributional model doesn't necessarily imply GAMs? Or is there another more significant difference?

Thanks!

paul-buerkner commented 1 year ago

the two terms mean in essence the same in practical use but the term distributional models is somewhat more general. brms' distributional models can do more flexible stuff that "just" GAMs for multiple distributional parameters (e.g., explicit nonlinear models) hence I prefer that term.

Tomás Capretto @.***> schrieb am Do., 5. Jan. 2023, 19:47:

@paul-buerkner https://github.com/paul-buerkner here's my first question :)

What's the difference between a "Distributional model" and a "GAMLSS" (Generalized additive model for location, scale and shape? I took the name "Distributional models" from brms. But as far as I understand they are very close to GAMLSS, perhaps the difference is in the GAM part, and the distributional model doesn't necessarily imply GAMs? Or is there another more significant difference?

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/bambinos/bambi/issues/544#issuecomment-1372600826, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2ACQXPCXD5MRYRZYTG3WQ4JLLANCNFSM53HNB7OQ . You are receiving this because you were mentioned.Message ID: @.***>

tomicapretto commented 1 year ago

Regarding naming conventions for terms in distributional models, someone pointed me via e-mail that {param}_{term}​ may be a better alternative than {param}{term}​. I'm citing the arguments:

"It might sound negligible but having it as a convention can be very powerful in the future. As evidence, see what the scikit-learn team did with model's names and their parameters in Pipeline objects. The problem is python's convention to name variables with a single underscore, and then there's no way to differentiate when param​ ends and term​ starts. Using (the rarer) dunders mostly solves that."

My position for the moment is "I understand the point, and while I don't have a strong opinion about the existing solution, I don't see it as a very large problem either.".

I share the small chat here so anyone can share their thoughts

tomicapretto commented 1 year ago

Regarding naming conventions for terms in distributional models, someone pointed me via e-mail that {param}_{term}​ may be a better alternative than {param}{term}​. I'm citing the arguments:

"It might sound negligible but having it as a convention can be very powerful in the future. As evidence, see what the scikit-learn team did with model's names and their parameters in Pipeline objects. The problem is python's convention to name variables with a single underscore, and then there's no way to differentiate when param​ ends and term​ starts. Using (the rarer) dunders mostly solves that."

My position for the moment is "I understand the point, and while I don't have a strong opinion about the existing solution, I don't see it as a very large problem either.".

I share the small chat here so anyone can share their thoughts

tomicapretto commented 1 year ago

@paul-buerkner I have some extra questions, now about GAMs :)

Let's take the following example

library(brms)

pisa <- readr::read_csv("https://raw.githubusercontent.com/m-clark/generalized-additive-models/master/data/pisasci2006.csv")
code <- make_stancode(Overall ~ 1 + s(Income, bs = "cr"), data = pisa)
data <- make_standata(Overall ~ 1 + s(Income, bs = "cr"), data = pisa)

brm_gam <- brm(Overall ~ 1 + s(Income, bs = "cr"), data = pisa)
mgcv_gam <- mgcv::gam(Overall ~ 1 + s(Income, bs = "cr"), data = pisa)

Below, the summaries of brm_gam and mgcv_gam, respectively

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Overall ~ 1 + s(Income, bs = "cr") 
   Data: pisa (Number of observations: 54) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Smooth Terms: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(sIncome_1)     5.13      3.14     1.40    13.52 1.01      482      598

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   470.69      4.15   462.28   478.63 1.00     4515     2753
sIncome_1     8.04      1.29     5.48    10.60 1.00     3259     2654

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    31.67      3.51    25.65    39.42 1.00     1185      624
Family: gaussian 
Link function: identity 

Formula:
Overall ~ 1 + s(Income, bs = "cr")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  470.444      4.082   115.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
            edf Ref.df     F p-value    
s(Income) 6.895  7.741 16.67  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =    0.7   Deviance explained = 73.9%
GCV = 1053.7  Scale est. = 899.67    n = 54

The Stan code produced is

// generated with brms 2.18.0
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  // data for splines
  int Ks;  // number of linear effects
  matrix[N, Ks] Xs;  // design matrix for the linear effects
  // data for spline s(Income, bs = "cr")
  int nb_1;  // number of bases
  int knots_1[nb_1];  // number of knots
  // basis function matrices
  matrix[N, knots_1[1]] Zs_1_1;
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
  vector[Ks] bs;  // spline coefficients
  // parameters for spline s(Income, bs = "cr")
  // standarized spline coefficients
  vector[knots_1[1]] zs_1_1;
  real<lower=0> sds_1_1;  // standard deviations of spline coefficients
  real<lower=0> sigma;  // dispersion parameter
}
transformed parameters {
  // actual spline coefficients
  vector[knots_1[1]] s_1_1;
  real lprior = 0;  // prior contributions to the log posterior
  // compute actual spline coefficients
  s_1_1 = sds_1_1 * zs_1_1;
  lprior += student_t_lpdf(Intercept | 3, 488, 50.4);
  lprior += student_t_lpdf(sds_1_1 | 3, 0, 50.4)
    - 1 * student_t_lccdf(0 | 3, 0, 50.4);
  lprior += student_t_lpdf(sigma | 3, 0, 50.4)
    - 1 * student_t_lccdf(0 | 3, 0, 50.4);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept + Xs * bs + Zs_1_1 * s_1_1;
    target += normal_lpdf(Y | mu, sigma);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(zs_1_1);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
}

I would like to check the following

Again, thanks a lot for brms and your help!