ejolly / pymer4

All the convenience of lme4 in python
http://eshinjolly.com/pymer4
MIT License
193 stars 26 forks source link

Bug in likelihood-ratio test p-value calculation for Lmer models #134

Open paxtonfitzpatrick opened 9 months ago

paxtonfitzpatrick commented 9 months ago

Hi, and thanks for maintaining such a fantastic package!

I removed most of the issue template since I think this is a pretty straightforward bug and not related to my environment/system/specific data/etc., but if you'd like a more complete report with that info please let me know!

When computing a likelihood-ratio test between 2 (or more) Lmer models, pymer4.stats.lrt() computes the degrees of freedom shown in the output table based on the difference in the number of parameters between the models: https://github.com/ejolly/pymer4/blob/4605b460e244621b56ef7786253de7f27e4fe77f/pymer4/stats.py#L592

But then to compute the p-value for the test, pymer4.utils._lrt() re-computes the degrees of freedom based on the number of rows in the two models' .coefs attributes (pandas DataFrames): https://github.com/ejolly/pymer4/blob/4605b460e244621b56ef7786253de7f27e4fe77f/pymer4/utils.py#L635 which, for an Lmer model, includes only the fixed effects. So if the random effects differ between the two models, the p-value will be wrong.

I think this can pretty easily be fixed by using pymer4.utils._get_params(mod) instead of mod.coefs.shape[0] in pymer4.utils._lrt(). E.g., to ensure the function still works with Lm and Lm2 model classes, something like:

def _lrt(tup):
    """Likelihood ratio test between 2 models. Used by stats.lrt"""
    from .models import Lmer    # local import to avoid circular dependency
    d = np.abs(2 * (tup[0].logLike - tup[1].logLike))
    n_params_mod1 = _get_params(tup[0]) if isinstance(tup[0], Lmer) else tup[0].coefs.shape[0]
    n_params_mod2 = _get_params(tup[1]) if isinstance(tup[1], Lmer) else tup[1].coefs.shape[0]
    return chi2.sf(d, np.abs(n_params_mod1 - n_params_mod2))

Or since the test statistic and df are both already computed in the outer pymer4.stats.lrt() function, it might be simpler to just move the p-value calculation there and remove this helper function entirely.

I've confirmed that the p-value computed this way (i.e., using _get_params()) matches the p-value given by anova(mod1, mod2) in R, as do all other values in the DataFrame returned by pymer4.stats.lrt().

Thanks again!

ejolly commented 3 months ago

Thanks for bringing this up and for the suggested fix @paxtonfitzpatrick !