Change base levels for categorical fields?

Quantco / glum

High performance Python GLMs with all the features!

https://glum.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

311 stars 25 forks source link

Change base levels for categorical fields? #826

Closed enriquecardenas24 closed 1 month ago

enriquecardenas24 commented 2 months ago

When modeling with glum using a dataset containing both categorical and numeric features, I want to manually set base levels for the categorical fields. This can be done in statsmodels models with the "formula" input. An example can be seen in a previous issue I opened, #777. In this issue, the base levels in the statsmodels model formula were set to 1.0 by jtilly in order to align the coefficients of the model to a glum model.

# References are base levels for categorical features.
formula = "Response~C(Year, Treatment(reference=1.0))"
formula += "+C(Field16952, Treatment(reference=1.0))"
formula += "+Field16995+Field17024+Field17041"  # all numeric here
formula += "+Field17045"
sm_fam = sm.families.Binomial()
sm_model = smf.glm(formula, train_data, family = sm_fam).fit()

Originally posted by @jtilly in https://github.com/Quantco/glum/issues/777#issuecomment-1979470033

Here, Year and Field16952 are categorical features with base level references.

Is there a way to modify the base levels of categorical features for a glum model?

stanmart commented 2 months ago

For the moment the best approach is to rely on the fact that glum drops the first level, so you can reorder levels using pandas.Series.cat.reorder_categories accordingly. You can even make it into a helper function like

import pandas as pd

def C_with_ref(var: pd.Series, reference: str):
    if isinstance(var.dtype, pd.CategoricalDtype):
        if reference not in var.cat.categories:
            raise ValueError(f"{reference} does not appear in the series")
        new_levels = [reference] + [c for c in var.cat.categories if c != reference]
        return var.cat.reorder_categories(new_levels)
    else:
        levels = var.unique()
        if reference not in levels:
            raise ValueError(f"{reference} does not appear in the series")
        new_levels = [reference] + [c for c in levels if c != reference]
        return var.astype(pd.CategoricalDtype(categories=new_levels))

and use it like so:

formula = "Response~C_with_ref(Year, reference=1.0)"
formula += "+C_with_ref(Field16952, reference=1.0)"
formula += "+Field16995+Field17024+Field17041"  # all numeric here
formula += "+Field17045"
sm_fam = sm.families.Binomial()
sm_model = smf.glm(formula, train_data, family = sm_fam).fit( context={"C_with_ref": C_with_ref})

It would be a nice QoL improvement though if we supported reference as an argument to C in glum.

enriquecardenas24 commented 2 months ago

@stanmart thank you for the reply. I understand the thought process, but I am having trouble applying this to my situation. I will try to rephrase this in a glum context for clarity. I am unfamiliar with how the context argument works, but suppose I have:

# Initialize Tweedie model.
model = glum.GeneralizedLinearRegressor(
    fit_intercept = True, family = glum.TweedieDistribution(1.6), link = 'log',
    gradient_tol = 1e-7,
)

# Set references for categorical features.
references = {
    'Year': 7.0,
    'Field16952': 21.0,
    'Field17041_e4': 1.0, # think 1.0 is the default
}
for f in references:
    X_train[f] = C_with_ref(X_train[f], references[f]) # using fn provided

# Set modeling input and fit the model.
X = tabmat.from_pandas(df = X_train, drop_first = is_GLM)
model.fit(
    X = X, y = y_train,
    store_covariance_matrix = True,
    # context = {'C_with_ref': C_with_ref},
)

I don't believe this is giving the desired effect, as my model predictions on the training set are the same as the predictions without setting base levels. I have also tried a few other things, including substituting the references dictionary in the context argument.

Can you please clarify how to use the context parameter in this example where given the references dictionary?

stanmart commented 2 months ago

The context parameter is only used when the model is based on a formula. In that case, all it does is that variables (including functions) included in the context dictionary can be used in the formula. Without it, in my example, C_with_ref could not have been used within the formula. Your solution of preprocessing the data outside of the formula works just as well.

The fact that the predictions do not change is not a bug. The model should have the same predictions regardless of the chosen references for the categorical variables*. The reason for this is the following. For any given categorical, take the set all dummy variables (i.e. without dropping any of them) plus the constant. Then, if you drop any variable from this set, the linear span of the remaining variables and the original set of variables will be the same. Therefore, therefore both can generate the exact same set of fitted values (of which the one minimizing the loss function will, again, be the same).

The only thing that changes are the coefficients for the dummies representing the category which you modify and the intercept term. Take a look at model.coef_table() to see those changes (and also the fact that the omitted level should be different depending on what you choose as the reference level).

If you are (1) fitting an unregularized model and (2) have an intercept term in the model, which is the case here.