bambinos / bambi

BAyesian Model-Building Interface (Bambi) in Python.
https://bambinos.github.io/bambi/
MIT License
1.07k stars 122 forks source link

Add informative message when prior scaling fails due to non-identifiable fixed effects #30

Closed tyarkoni closed 8 years ago

tyarkoni commented 8 years ago

When a model with default priors and non-identifiable fixed effects is created (e.g., by calling something like add_term('condition', drop_first=False) when an intercept is already in the model), the prior scaling approach currently implemented in PriorScaler sets the random factor hyperprior variances to inf, and the sampler consequently refuses to start. We should either check for this scenario and substitute some large but sane value, or raise an exception with a more informative message rather than waiting for Theano to complain.

tyarkoni commented 8 years ago

On further digging, the problem arises even if one excludes the intercept. In the case of full-rank coding of a factor (i.e., drop_first=False), each level of the factor can always be perfectly predicted from the others (so the r2_x values get set to 1, and hence the variances of the random effects get set to infinity), even though the model can still be estimated just fine as long as the intercept is excluded. So we'll need to handle this case silently rather than raising an exception.

jake-westfall commented 8 years ago

Let me work on this for a little bit. I have an idea about how this could be resolved nicely, but we'll see how that goes.

jake-westfall commented 8 years ago

The idea worked :) I also added code that catches when there is perfect collinearity among the fixed effects and prints an informative error (including the full dm_statistics and _diagnostics dictionaries). I have tests running locally right now but I'm pretty sure they will pass, when they're done I'll push the changes to the new_model_tests branch and close this issue.

jake-westfall commented 8 years ago

Hmm, did the push not close this issue because it's not in the main branch?

tyarkoni commented 8 years ago

Yeah, it won't close issues until they're merged into master. Will take a look at it right now.

tyarkoni commented 8 years ago

...or whenever you open a PR. :)

jake-westfall commented 8 years ago

I was going to wait until all the tests were passing before doing the PR. Or I could just do it now, either way I guess!

tyarkoni commented 8 years ago

Doesn't really matter; the tests will keep running after you open a PR (and I won't merge until they pass). I just tested the new branch and am getting some errors I haven't seen before, so there might still be some work to do. But new commits will just keep getting applied to the open PR, so that won't matter either.

tyarkoni commented 8 years ago

Looks like the Travis build failed; see https://travis-ci.org/bambinos/bambi/builds/155777052.

tyarkoni commented 8 years ago

Oh, just saw your comment in the PR. I'll try to fix those random effects issues before merging.

tyarkoni commented 8 years ago

There's a bug in the fixed effects handling that shows up given something like this:

model_3v = Model(data, intercept=False)
model_3v.add_term('word_class', categorical=True, drop_first=False)

In this case there are 2 levels of word_class. The problem is that the scaling code will set the prior SD to an array (i.e., each level of the factor gets its own estimated SD), which breaks PyMC3 (hyperpriors can't have shape > 1). Would it make sense to either pool the estimated SDs, or just take the largest one? We could also set it up so that each level of the fixed factor is added to the model separately, but that would get a bit messy when it comes time to summarizing and plotting the results.

tyarkoni commented 8 years ago

I just confirmed that taking the max of slope_constant in _scale_fixed() solves the problem. Not sure if that suffices, or if you want to do something more principled.

jake-westfall commented 8 years ago

I can't reproduce the error in the version from the PR... specifically, this code (using the crossed_data dataset from /tests/data) works for me:

model_3v = Model(crossed_data, intercept=False) model_3v.add_y('Y') model_3v.add_term('threecats', categorical=True, drop_first=False) model_3v.build() fitted = model_3v.fit()

tyarkoni commented 8 years ago

Hmm, odd. Let me try it with your sample data.

jake-westfall commented 8 years ago

FYI, the priors (as shown by {x.name:x.prior.args for x in model_3v.terms.values()}) are:

{'threecats': {'mu': array([ 3.28978105]), 'sd': array([ 6.78490202])}}

tyarkoni commented 8 years ago

Sorry, I had it wrong: you have to add a random effect to the model to get the problematic behavior. I get the same prior as you do for threecats in the above model. But if I add this:

model.add_term('subj', categorical=True, random=True, drop_first=False)

...I now get these SDs for threecats:

{'mu': 0, 'sd': array([ 6.19197967, 5.48812616, 6.08593685])}

Note that it's the fixed effect prior that's problematic in the presence of the random factor, not the random factor prior.

jake-westfall commented 8 years ago

Oh, I see. The issue seems to be that in _scale_fixed() in priors.py, this case "should" be ending up in the 'else' clause (which is what happens when there are no random effects), but adding the random term to the model appears to make self.stats not None, so it ends up in the 'if' branch, where it gets treated like in the intercept-included case.

jake-westfall commented 8 years ago

It might work to change if self.stats is not None: on line 187 to if len(self.model.fixed_terms) > 1:. Although I'm not sure why adding the random term to the model is causing the dm_statistics object to get created anyway...?

jake-westfall commented 8 years ago

Oh, it's because dm_statistics gets created during building if len(self.terms) > 1, but I'm pretty sure it should be if len(self.fixed_terms) > 1

jake-westfall commented 8 years ago

I think making that change should resolve this whole issue... if you already have code edited it might be easier for you to do it instead of me. Just thinking about how to avoid merge conflicts.

tyarkoni commented 8 years ago

I think it would happen even with multiple fixed effects, without the random effects... it's the len(self.terms) > 1 check that seems problematic.

Probably a better way to go is to always compute dm_statistics, and then check the properties of the Term instance passed to the _scale_{term} methods in PriorScaler to determine what to do. E.g., in this case, I think you can use term.categorical as your check, because slopes will always be continuous and intercepts will be categorical.

I'll take a crack at this unless you want to do it.

jake-westfall commented 8 years ago

Well, unless len(self.fixed_terms) > 1, there's not much to compute for dm_statistics -- for example, r2_x can't be filled in -- the only exception being the cell-means case. I can try to fix this, unless you've already started. Multiple fixed effects seem to work, although the current test just puts in 1 continuous, 1 dummy (which looks numeric), and 1 three-level categorical; I haven't tried a model with multiple categoricals each with >2 levels. I should add that test.

tyarkoni commented 8 years ago

Okay, that problem is solved by modifying the _scale_fixed() conditional to check term.categorical. There was also a separate problem with models involving more than one fixed effect term having to do with pandas indexing in the concat call, but I think I fixed that too. Let's see how this works now.

jake-westfall commented 8 years ago

Cool. I hadn't run across the concat error, do you have an example of a model that was tripped up by this? I haven't had any issues with models with multiple fixed effects, including some models I just tested with multiple categorical predictors each with >2 levels (i.e., fixed-effect ANOVA models).

tyarkoni commented 8 years ago

Haven't gone back to the concat issue yet, but I'm running into a different problem. It looks like if you only have a single random effect in the model, the dm_statistics don't get computed, even though _scale_random() makes reference to sd_x and sd_y. This seems like it will need to be added as a special case. Do you want to add that to _scale_random()?

tyarkoni commented 8 years ago

Actually, let me take a shot at this. Seems like an easy fix is to always compute sd_y, and to default to a value of 0 for r2_x if it doesn't exist.

jake-westfall commented 8 years ago

Default values of 0 for both r2_x and r2_y seems like it should work. It makes statistical sense too, because then it's just like we're setting the prior on the simple correlation rather than the partial correlation.

jake-westfall commented 8 years ago

I think I'm fairly close to having the build passing, even with two new tests :)

tyarkoni commented 8 years ago

Great--though I noticed some issues that will require unrelated changes to the code to get the tests to pass (e.g., term labeling conventions are different for the formula vs. add_term interfaces, so testing for equivalency of term names is likely to fail in cases involving | relationships unless you explicitly set labels when using add_term).

Also, I've decided to change the add_term semantics to be more consistent with the formula interface. I'm going to get rid of split_by and replace it with an over argument that behaves essentially like |. Shouldn't interfere with anything you're doing in terms of the prior scaling, but we will definitely need to edit all of the tests after the branches are merged.

jake-westfall commented 8 years ago

Yeah, I noticed that the term labeling sometimes differed, so instead I've been focusing on testing that design matrices are equivalent, even if names / order of columns/ etc. differ. I do check some term names in cases where I am pretty sure they are supposed to be the same even across add_formula and add_term.