Closed enriquecardenas24 closed 1 month ago
For the moment the best approach is to rely on the fact that glum drops the first level, so you can reorder levels using pandas.Series.cat.reorder_categories
accordingly. You can even make it into a helper function like
import pandas as pd
def C_with_ref(var: pd.Series, reference: str):
if isinstance(var.dtype, pd.CategoricalDtype):
if reference not in var.cat.categories:
raise ValueError(f"{reference} does not appear in the series")
new_levels = [reference] + [c for c in var.cat.categories if c != reference]
return var.cat.reorder_categories(new_levels)
else:
levels = var.unique()
if reference not in levels:
raise ValueError(f"{reference} does not appear in the series")
new_levels = [reference] + [c for c in levels if c != reference]
return var.astype(pd.CategoricalDtype(categories=new_levels))
and use it like so:
formula = "Response~C_with_ref(Year, reference=1.0)"
formula += "+C_with_ref(Field16952, reference=1.0)"
formula += "+Field16995+Field17024+Field17041" # all numeric here
formula += "+Field17045"
sm_fam = sm.families.Binomial()
sm_model = smf.glm(formula, train_data, family = sm_fam).fit( context={"C_with_ref": C_with_ref})
It would be a nice QoL improvement though if we supported reference as an argument to C
in glum.
@stanmart thank you for the reply. I understand the thought process, but I am having trouble applying this to my situation. I will try to rephrase this in a glum
context for clarity. I am unfamiliar with how the context
argument works, but suppose I have:
# Initialize Tweedie model.
model = glum.GeneralizedLinearRegressor(
fit_intercept = True, family = glum.TweedieDistribution(1.6), link = 'log',
gradient_tol = 1e-7,
)
# Set references for categorical features.
references = {
'Year': 7.0,
'Field16952': 21.0,
'Field17041_e4': 1.0, # think 1.0 is the default
}
for f in references:
X_train[f] = C_with_ref(X_train[f], references[f]) # using fn provided
# Set modeling input and fit the model.
X = tabmat.from_pandas(df = X_train, drop_first = is_GLM)
model.fit(
X = X, y = y_train,
store_covariance_matrix = True,
# context = {'C_with_ref': C_with_ref},
)
I don't believe this is giving the desired effect, as my model predictions on the training set are the same as the predictions without setting base levels. I have also tried a few other things, including substituting the references
dictionary in the context argument.
Can you please clarify how to use the context
parameter in this example where given the references
dictionary?
The context
parameter is only used when the model is based on a formula. In that case, all it does is that variables (including functions) included in the context dictionary can be used in the formula. Without it, in my example, C_with_ref
could not have been used within the formula. Your solution of preprocessing the data outside of the formula works just as well.
The fact that the predictions do not change is not a bug. The model should have the same predictions regardless of the chosen references for the categorical variables*. The reason for this is the following. For any given categorical, take the set all dummy variables (i.e. without dropping any of them) plus the constant. Then, if you drop any variable from this set, the linear span of the remaining variables and the original set of variables will be the same. Therefore, therefore both can generate the exact same set of fitted values (of which the one minimizing the loss function will, again, be the same).
The only thing that changes are the coefficients for the dummies representing the category which you modify and the intercept term. Take a look at model.coef_table()
to see those changes (and also the fact that the omitted level should be different depending on what you choose as the reference level).
When modeling with
glum
using a dataset containing both categorical and numeric features, I want to manually set base levels for the categorical fields. This can be done instatsmodels
models with the "formula" input. An example can be seen in a previous issue I opened, #777. In this issue, the base levels in thestatsmodels
model formula were set to 1.0 by jtilly in order to align the coefficients of the model to aglum
model.Originally posted by @jtilly in https://github.com/Quantco/glum/issues/777#issuecomment-1979470033
Here,
Year
andField16952
are categorical features with base level references.Is there a way to modify the base levels of categorical features for a
glum
model?