bambinos / formulae

Formulas for mixed-effects models in Python
https://bambinos.github.io/formulae/
MIT License
54 stars 15 forks source link

Version 0.3.3 crashes kernel with large dataset #75

Open alexjonesphd opened 2 years ago

alexjonesphd commented 2 years ago

Hi all,

Thank you for the work on this awesome package!

I am using bambi to fit a fairly complex model on a large dataset with a single group effect, which has many individuals. There are around 183,000 rows and 30,542 unique groups. Version 0.3.3 crashes Jupyter reliably when instantiating this design matrix. Interestingly, version 0.2.0 will instantiate it, with the caveat I can't include an interaction term I can in 0.3.3 within bambi (see bambi issue 495).

I've tried a few different approaches, and have found it will instantiate with about 25-50% of the data (using DataFrame.sample(frac=.25)), so it seems more an issue of sheer scale than anything else. I've also tried with Spyder, getting the same issue.

The code below will grab a modified dataset of the same size and structure I am using and set up the model design, which kills my kernel after a minute or two.

import pandas as pd
from formulae import design_matrices

trouble = pd.read_feather('https://osf.io/kw2xh/download')
md = design_matrices('response ~ bs(time, degree=2, df=10) * state + state * level * trait + (1|pid)', data=trouble)

0.2.0 will return this after a little while, however. Any help is greatly appreciated, hopefully this issue isn't localised to my machine!

tomicapretto commented 2 years ago

Do you have this error?

Unexpected error while trying to evaluate a Variable. <class 'numpy.core._exceptions._ArrayMemoryError'>
MemoryError: Unable to allocate 41.7 GiB for an array with shape (183182, 30542) and data type int64
tomicapretto commented 2 years ago

The problem arises because of the (1|pid) in the model formula. There you have 30542 unique pid values, which results in a matrix of 183182 rows and 30542 columns. If formulae was smarter, we could use a sparse matrix here.

However... I'm not sure how well this is going to work in a model. You'll have 30542 parameters for this term and that seems like a lot. Have you ever worked with such sizes in other GLM frameworks?

alexjonesphd commented 2 years ago

Hey @tomicapretto I did not get that error, just a kernel crash! But I can definitely see why that's happening, I hadn't realised the explosion behind the scenes for the group specific matrix 🤯

I've worked with similar sized datasets a lot with GLM's but actually never with group-specific terms. Each pid provides several state measures, so my intuition was to incorporate the group effects to properly address those repeated measures. But perhaps this isn't the way to go with the scale of this. Incidentally I tried to build a model for this using pymc3 and estimated it; but noticed the HDI's for the state variable absorbed into the intercept were non-existent; and noticed bambi does something smart with categorical predictors. Any other advice is greatly appreciated, but realise this may be out of scope of the code issue at hand.

tomicapretto commented 2 years ago

Alex,

I guess you're getting a similar error, but somehow Jupyter does not show that to you (unless you have > 41 GB of RAM). I'm not sure what's the model you built with PyMC, but the fitting certainly depends on whether you include this pid predictor or not. No matter you're using a non-hierarchical or hierarchical model, I think it's tricky simply because of the dimensionality the posterior that's induced with the 30k parameters for pid.

Maybe you could try with a simpler model, or

alexjonesphd commented 2 years ago

Thanks @tomicapretto! With PyMC3 I used patsy to build the design matrix for fixed effects, and then just indexed the non-centred group effect using codes as is usually done in PyMC for hierarchical models. It sampled fine with ADVI, and the predictions looked sensible.

What's strange about my issue is that using the conda install of bambi (so 0.2.0 for formulae), the model will build, sample with ADVI, and predict just fine, albeit with the modification of the design to the below:

'response ~ bs(time, degree=2, df=10)  + state * level * trait + (1|pid)'

Unfortunately, I need that interaction term, but I can make do with PyMC3 here. I am still noticing that even when fitting the model with ADVI via bambi that the predictions for the state variable that is absorbed into the intercept are very narrow, whatever level I set as the reference category. Perhaps that is normal, but it struck me as odd, and sent me on this path of figuring all of this out!

tomicapretto commented 2 years ago

Hmmm to be honest I don't know why it works with formulae==0.2.0 and it does not work with formulae>0.3.0. I do think there are a couple of things we could do to improve it, which is adding support for sparse matrices and using them appropriately in Bambi.

In the meantime, I would recommend to use the PyMC approach if that works for you :+1:

alexjonesphd commented 2 years ago

I will stick with the PyMC/patsy hybrid I have up and running for now! I'm relieved to see bambi gives me very similar results when it does build, which is a good sanity check. I'd be happy to try and contribute something with sparse matrices for formulae but will need to study the docs first!

Thanks for all your help @tomicapretto