Elastic net parameter selection question

enriquecardenas24 commented 11 months ago

Hello all,

I'm interested in the elastic net parameter selection method used by GeneralizedLinearRegressorCV. I know it selects the alpha and l1_ratio parameters that give the lowest model deviance thanks to @jtilly. However, I am curious about the meaning behind the term "deviance" and the reasoning as to choosing the parameters based on this metric as opposed to something like MSE. I also want to ask why MSE is often significantly higher in the built-in parametrization cases.

Below are snippets of some basic code I have.

First, this bit is to select the best alpha and l1_ratio parameters according to the built-in selection method:

strength = [1e-12, 1e-3, 0.1, 1, 5, 10]
mixing = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

reg_model = glum.GeneralizedLinearRegressorCV(alphas = strength,  l1_ratio = mixing, 
                                              P1 = Incl_Reg, P2 = Incl_Reg, # values of 1 always
                                              family = family, link = 'log', # family is Tweedie
                                              )
reg_model.fit(X_Train_Reg_Params, Data_All_Train[data_response], Data_All_Train[data_weight])
strength1 = reg_model.alpha_
mixing1 = reg_model.l1_ratio_

I can then save this model by rerunning using GeneralizedLinearRegressor and single values (strength1, mixing1) for those parameters selected. I have also done this for no strength input (alphas = None) so the class can select a more accurate alpha, which often gives different results.

Second, using the same parameters from the above snippet, I also have a basic setup to record MSE values for different combinations of possible parameters:

# Function to get MSE of a model's predictions
def Fn_Get_MSE(model, MSE_type):

    if MSE_type == 'train':
        input_dataset = Data_All_Train
        input_predictors = X_Train
        weights = Data_All_Train[data_weight]
    else:
        input_dataset = Data_All_Test
        input_predictors = X_Test
        weights = Data_All_Test[data_weight]

    predictions = pd.DataFrame()
    predictions[data_obsv] = input_dataset[data_obsv] # observed (actual) values
    X_Model = tabmat.from_pandas(df = input_predictors, drop_first = drop_first)
    predictions['Pred'] = model.predict(X_Model, weights) # / weights

    predictions['Error'] = predictions['Pred'] - predictions[data_obsv]
    predictions['Sq_error'] = predictions['Error'] ** 2
    MSE = np.mean(predictions['Sq_error'])

    return MSE

combinations = np.array(np.meshgrid(strength, mixing)).T.reshape(-1, 2)
train_MSE = []
test_MSE = []
for combo in combinations:
    strength1 = combo[0]
    mixing1 = round(combo[1], 1)

    glum_model_manual_EN = glum.GeneralizedLinearRegressor(alpha = strength1, l1_ratio = mixing1,
                                                           P1 = Incl_Reg, P2 = Incl_Reg, family = family, link = link_func,
                                                           )

    glum_model_manual_EN.fit(X_Train_Model, Data_All_Train[data_response], Data_All_Train[data_weight])
    train_MSE_val = Fn_Get_MSE(glum_model_manual_EN, MSE_type = 'train')
    test_MSE_val = Fn_Get_MSE(glum_model_manual_EN, MSE_type = 'test')            
    train_MSE.append(train_MSE_val)
    test_MSE.append(test_MSE_val)

# Get best parameters based on train MSE
min_idx = train_MSE.index(min(train_MSE))
best_params = combinations[min_idx]
strength1 = best_params[0]
mixing1 = round(best_params[1], 1)
# Fit model again using these parameters
glum_model_manual_EN = Fn_Run_Glum_Model(strength1, mixing1, prev_time = prev_time)

I have tried these methods on 5 different datasets (see attached: pre_banded_index.csv), and I have noticed that the model with built-in parametrization does not always select the model with the lowest MSE. For example, in the case of this dataset, using fold 2 as the test and 1, 3, 4 as the train, the models give MSE values:

Modeling method	Train MSE	Test MSE
Built-in search	2.04515e8	2.33169e8
Built-in search, no alpha input	2.04530e8	2.33229e8
Manual search	2.04507e8	2.33114e8

On the other hand, manual iteration through a grid of strength and mixing values eventually gives a model with a lower MSE, as shown. This may be expected, since I am selecting parameters based on MSE in the manual case, but my questions still remain.

To summarize, my questions are:

What is meant by selecting parameters with the lowest "deviance"?
Why use the deviance metric for parametrization?
Why isn't built-in parametrization more accurate than manual parameter selection?

I'll be happy to provide more detail if necessary.

lbittarello commented 11 months ago

Hi @enriquecardenas24,

The deviance is equal to the log likelihood (LL), up to constants. To be precise, it's twice the difference between the LL of a saturated model (one which fits the data exactly) and the LL of the fitted model. We use the deviance because it's faster to calculate than the LL, but the resulting coefficients are the exact same.

By minimising the deviance, therefore, we're actually maximising the LL – a method known as "maximum likelihood estimation" (MLE). MLE has many nice properties, including asymptotic normality, robustness to misspecification and efficiency. It's the statistical framework of choice for GLMs and also underlies many estimators in machine learning (e.g., LightGBM).

Importantly, the deviance measures the quality of fit under the model's distributional assumptions. Suppose for example that you have exponentially distributed data. If you try to minimise the MSE, you'll be treating deviations from the mean symmetrically and focusing on outliers. You won't take it into account that exponential data are bounded below at zero and that they have a long tail, so your predictions will end up too high for the majority of the data, since they lie below the mean. The deviance pays heed to the distribution of the data and will give more weight to the left tail, where most data points are.

Your benchmarks are tricky to interpret. The inbuilt parametrisation is more accurate in terms of the metric under consideration (the deviance / LL). It isn't surprising that it shouldn't be the best under other metrics. You should always be careful about your choice of metric and whether it's appropriate for your data and for your use case – the MSE isn't always the best!

Fun fact: the deviance is equal to the MSE if the data are normally distributed.

enriquecardenas24 commented 11 months ago

@lbittarello,

Thank you for your response, it has been very helpful. Distributional assumptions would be key here, since I'm modeling on Tweedie distributed response data.

In that case, I have one further question related to deviance calculation:

For my manual grid search, I have changed the evaluation metric to model deviance:

X_Train_Model = tabmat.from_pandas(df = X_Train, drop_first = True)
tweedie_p = 1.6
family = glum.TweedieDistribution(power = tweedie_p)

# Search through all combinations of strength and mixing
combinations = np.array(np.meshgrid(strength, mixing)).T.reshape(-1, 2)
metric_vals = []

for combo in combinations:
    strength1 = combo[0]
    mixing1 = round(combo[1], 1)

    glum_manual_EN = glum.GeneralizedLinearRegressor(alpha = strength1, l1_ratio = mixing1,
                                                     P1 = Incl_Reg, P2 = Incl_Reg, family = family, link = link_func,
                                                     )

    glum_manual_EN.fit(X_Train_Model, Data_All_Train[data_response], Data_All_Train[data_weight])
    mu = model.predict(X_Train_Model, Data_All_Train[data_weight])
    deviance = glum_manual_EN.family_instance.deviance(Y_Train, mu, Data_All_Train[data_weight])
    metric_vals.append(deviance)

# Get best parameters based on deviance
min_idx = train_metric_vals.index(min(train_metric_vals))
best_params = combinations[min_idx]
strength1 = best_params[0]
mixing1 = round(best_params[1], 1)

# Fit model again using these parameters
glum_manual_EN = glum.GeneralizedLinearRegressor(alpha = strength1, l1_ratio = mixing1,
                                                 P1 = Incl_Reg, P2 = Incl_Reg, family = family, link = link_func)
glum_model.fit(X_Train_Model, Data_All_Train[data_response], Data_All_Train[data_weight])

When looking at each of the values, I get the following:

We can see that choosing a strength of near 0 would be optimal.

I have also reused the code from the first chunk in my initial post to use the built-in parametrization.

However, using the manual grid search method still appears to be more effective than the built-in parameter selection in terms of selecting a model with lower deviance. Below are results from 5 different models using the same dataset from my initial post:

The first model, "glum_EN," equates to the first code chunk in my initial post; this makes use of built-in optimization. The second model, "glum_EN_no_alpha," is similar to the first model, except that it does not give strength input parameters, allowing glum to select a more accurate alpha. The third model is my manual grid search. The fourth model consistently uses parameters of 0 for each input. The fifth model is a statsmodels implementation, which does not use elastic net.

Speaking to the results, I am a bit confused. It appears that the built-in parametrization is not always selecting the parameters that give the lowest deviance either. Also, glum's deviance calculation reports a higher deviance than statsmodel's glm implementation; this problem could be out of the scope of this thread, however.

Could there be a reason why glum isn't selecting the parameters that give the lowest model deviance? Furthermore, how could statsmodels be beating glum in terms of deviance, considering that no elastic net regularization has been placed on it?

lbittarello commented 11 months ago

Thanks for your results, @enriquecardenas24!

As the name suggests, GeneralizedLinearRegressorCV uses cross validation to tune the regularisation parameters. Here's the idea: (1) we first split the data into folds; (2) we then train several versions of the model for each parameter set, leaving one fold out at a time; (3) we predict from each model for the validation fold, which was not used for training; (4) finally, we score predictions.

Cross validation attempts to select the model that performs best for unseen data, replicating the conditions under which the model will be used. If you just choose the model that performs best on the train data, your model is likely to overfit and predictions may not extrapolate to new data. Scikit-learn has a nice discussion of cross validation here.

Here's some code for you to play with:

```python # %% from collections import defaultdict from itertools import product import glum import numpy as np import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.model_selection import KFold from sklearn.preprocessing import StandardScaler # %% X, y = load_breast_cancer(return_X_y=True, as_frame=True) X = pd.DataFrame(StandardScaler().fit_transform(X), columns=X.columns) grid_l1_ratio = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0] grid_alpha = [0.05, 0.01, 0.005, 0.001] family = glum.BinomialDistribution() splitter = KFold() # %% deviances = defaultdict(list) for alpha, l1_ratio in product(grid_alpha, grid_l1_ratio): mu = np.zeros(len(y)) for train_ix, test_ix in splitter.split(X, y): mu[test_ix] = ( glum.GeneralizedLinearRegressor( alpha=alpha, l1_ratio=l1_ratio, family="binomial" ) .fit(X.iloc[train_ix], y.iloc[train_ix]) .predict(X.iloc[test_ix]) ) deviances["alpha"].append(alpha) deviances["l1_ratio"].append(l1_ratio) deviances["deviance"].append(family.deviance(y, mu)) pd.DataFrame(deviances).sort_values("deviance") # %% cv_glm = glum.GeneralizedLinearRegressorCV( alphas=grid_alpha, l1_ratio=grid_l1_ratio, family="binomial", cv=splitter ) cv_glm.fit(X, y) print(cv_glm.alpha_) print(cv_glm.l1_ratio_) # %% ````

how could statsmodels be beating glum in terms of deviance, considering that no elastic net regularization has been placed on it?

To be honest, I'm puzzled. We actually have tests to check that glum produces the same results as statsmodels (here, for example). It would be helpful to have a minimal reproducible example.

enriquecardenas24 commented 11 months ago

@lbittarello,

Thank you for your expeditious reply.

To the first part of your reply, I think I understand. It seems like you are saying that a manual grid search won't generalize well past training data, so the built-in parametrization will be better than a manual search even if the deviance results returned are a bit higher for the built-in parameter selection. Thank you for the code, I will take a look.

To the second part regarding statsmodels, I have prepared a brief example in a .py file in this zipped file: GitHub Example.zip

Here is a brief explanation of each of the files:

glum_EN_github_ex_722.py: The code file to get model results for each model using the input data.

You can change your working directory in line 19.
You can alter an encoding or a modeling setting with lines 25-26, though they will generally provide similar or the same results as they are now.
One important note to the statsmodels part of the code is that the "formula" using each of the input features was generated with a longer script. If a reproducible version of that script is needed, I can also provide it.

Model_2_Fields.csv: Used to get data types of each field. Specifically, it's used to convert data types of categorical fields in the input predictors.
Model_2_Transforms.csv: Used for value encoding for categorical features, as well as for features where we want different levels to be treated as the same value. For example, where Field68652 would have 21 levels without encoding, we instead put several different levels on the same level.
pre_banded_index.csv: The input dataset.

Hopefully this helps. Any clarity you could provide would be greatly appreciated.

lbittarello commented 11 months ago

I dug into your code a bit. The discrepancy arises because statsmodels doesn't multiple predictions by the sample weights when it computes the deviance, whereas glum does (because you pass the weights to the predict method in L109 of your script).

Whether you should multiply predictions by the weights depends on what the weights mean. I don't know the details of the data / application, so I can't say. But the mechanics are correct at least and glum and statsmodels arrive at the same model. :)

enriquecardenas24 commented 11 months ago

@lbittarello,

This makes a lot of sense looking back. Giving weights of 1 for all records proves to give equal deviance results. Alternatively, using a manual family.deviance calculation with mu inputs of weighted predictions gives a deviance similar to that of the glum models. Thank you for pointing this out!

I have one more question then regarding the difference between glum elastic net models and statsmodels models. In your second most recent reply, you noted that glum should produce the same results as statsmodels. What did you mean by the term "results"? I don't see it being the predictions, because the predictions between the elastic net and statsmodels models differ slightly. Did you simply mean the deviance results?

lbittarello commented 11 months ago

I meant the same coefficients and predictions. The largest different in predictions between glum and statsmodels is 0.000425, i.e. 0.00000002% – the predictions are essentially identical.

enriquecardenas24 commented 11 months ago

@lbittarello,

Can it be true that the predictions are the same? If so, what would differentiate glum's elastic net from statsmodels without elastic net?

Using the predict functions for each does return different predictions:

sm_mu = sm_model.predict(X_Train) * Weights_Train
glum_EN_mu = glum_EN_model.predict(X_Train_Model, Weights_Train)
pd.concat([sm_mu.reset_index(drop = True), pd.Series(glum_EN_mu)], axis = 1)

lbittarello commented 11 months ago

Sorry, I misunderstood your question! I thought that you'd found statsmodel to beat glum with the same parametrisation.

To answer the original question: why do we end up with a higher deviance when we cross-validate the penalty? As it happens, the regularised model is actually better than the unregularised one in the context of cross-validation, but it's slightly worse when we train the model on the full train sample. This can happen because, when we cross-validate, we fit the candidate models on a subset of the train sample, so the optimal parameters may not be the best for the full train sample. Unfortunately, there's no easy fix. However, the risk of overfitting tends to be more serious than the risk of choosing slightly suboptimal parameters, so cross validation generally pays off.

enriquecardenas24 commented 11 months ago

@lbittarello,

Got it. So in exchange for the chance of a bump in deviance on the train dataset, the elastic net models are expected to generalize better to unseen data.

So, since MSE won't be sufficient in evaluating glum's elastic net performance on non-normally distributed data, do you have a suggestion of how to evaluate a glum model's performance on unseen data? For example, would there be one similar to this example of calculating residual deviance in scikit-learn, perhaps using test data?

EDIT: I believe my question boils down to this:

How do I know that glum will generalize well past the training dataset if deviance isn't indicative enough on the training dataset, and MSE isn't the appropriate evaluation metric for the data distribution?

lbittarello commented 11 months ago

How do I know that glum will generalize well past the training dataset if deviance isn't indicative enough on the training dataset, and MSE isn't the appropriate evaluation metric for the data distribution?

I don't think that there's anything wrong with the deviance per se. If you're only working with the train sample, no metric will tell you much about out-of-sample performance.

There are a few strategies to test how well your model generalises. For example, you could cross-validate the metric of interest (the deviance, in this case) to have an idea of what it looks like on unseen data. You could also set a fraction of the train sample aside to use as a validation sample. Of course, these strategies are all subject to sampling noise, so they might not be able to tell very similar models apart perfectly, but they're still helpful.

It might also be a good idea to look at a range of metrics instead of a single one. For instance, you could compute the deviance, the Gini index and the RMSE. A trustworthy model should perform well across the board (though it may not be the best under any single metric).

enriquecardenas24 commented 11 months ago

For example, you could cross-validate the metric of interest (the deviance, in this case) to have an idea of what it looks like on unseen data.

What do you mean by this? I do have data split into train/test, but I'm unsure how I would apply it to this advice. For example, I have been using fold 2 as the test and 1, 3, 4 as the train, and therefore the model deviance has previously been calculated on folds 1, 3 and 4. Do you mean calculating deviance for each of the test folds using the model trained on the other folds, then comparing those?

For example:

# Get train deviance
train_mu = model.predict(X_Train_Model, Data_All_Train[data_weight])
train_deviance = model.family_instance.deviance(Y_Train, train_mu, Data_All_Train[data_weight])
# Get test deviance
test_mu = model.predict(X_Test_Model, Data_All_Test[data_weight])
test_deviance = model.family_instance.deviance(Y_Test, test_mu, Data_All_Test[data_weight])

lbittarello commented 11 months ago

Do you mean calculating deviance for each of the test folds using the model trained on the other folds, then comparing those?

Yes – for example, you could use scikit-learn's cross_val_score.

Quantco / glum

Elastic net parameter selection question #722