TuringLang / TuringGLM.jl

Bayesian Generalized Linear models using `@formula` syntax.
https://turinglang.org/TuringGLM.jl/dev
MIT License
70 stars 7 forks source link

Roadmap for Hierarchical Models #21

Open storopoli opened 2 years ago

storopoli commented 2 years ago
jfhawkin commented 1 year ago

When trying to improve a Turing hierarchical intercept logistic model by reviewing turing_model.jl, I noticed that function _model(μ_X, σ_X, prior, intercept_ranef, idx, ::Type{Bernoulli}) includes a normalization on the dependent variable, which here is 0/1. It gives me an error because mad(y) in my case is 0, which messes with the hyperparameter $\tau$ for SD. I thought I'd bring it to the developers' attention.

mad_y=mad(y; normalize=true) (ln 266)

storopoli commented 1 year ago

Thanks for pointing this out!

For brms, the default prior for group-level parameters standard deviation is:

"restricted to be non-negative and, by default, have a half student-t prior with 3 degrees of freedom and a scale parameter that depends on the standard deviation of the response after applying the link function. Minimally, the scale parameter is 2.5"

where as for rstanarm it is much more complicated with LKJs and product of simplexes.

I don't think we have a problem with how the prior currently is defined (based on MAD). This is the formula for MAD:

$$\operatorname{MAD} = \operatorname{median} (|X_{i}-{\tilde {X}}|)$$

Now if your $y$ has MAD 0, then it has no variability.

jfhawkin commented 1 year ago

Thanks! I must be having a different issue then. I'm porting an rstanarm example for multi-level regression with poststratification into Turing (posted to Discourse here.

I run

fm = @formula(abortion ~ (1 | state) + male)
model = turing_model(fm, cces_df; model=Bernoulli);
chn = sample(model, NUTS(), 2_000);

and get the error DomainError with 0.0: AffineDistribution: the condition !(iszero(σ)) is not satisfied.

After getting this error, I had looked at the TuringGLM code and saw the mad(y) call. I calculated it for my dataset using mad from StatsBase. It gives me 0 to 5 decimal places. I also calculated it manually to confirm and get the same result. The data has variation because about 42% of observations are 1. Given MAD is the median of the absolute differences from the median, this would be the case for many datasets with 0/1 binary data. I ran a very simple experiment. If your data are [0,0,0,1] the median is 0 and the median variation is 0. If your data are [0,1,1,1] the same thing is true because the variations are 1-1=0. If you have [0,0,1,1], then you get a MAD = 0.5 (i.e, non-zero). I think this is true unless you have a 50/50 split (did a quick test in Excel repeatedly calculating it on vectors of 0/1 and never got a non-zero value). The mean varies but the median difference is a function of the fact that differences will always be 0 or 1 and you are taking the middle value.

storopoli commented 1 year ago

Oh I see, it might be more interesting to use the standard deviation std instead of MAD?

jfhawkin commented 1 year ago

Makes sense to me.