Model Tuning: Validation Metric

markpearl commented 2 years ago

I'm looking into the code for the xgboostlss class and it seems like the validation metric is hardcoded to use negative log-likelihood. Is there going to be flexibility to define the validation metric chosen? (i.e. MAE, etc.)

As of now my tuning process is returning inf for each trial.

StatMixedML commented 2 years ago

@markpearl Thanks for your interest in the project.

I'm looking into the code for the xgboostlss class and it seems like the validation metric is hardcoded to use negative log-likelihood. Is there going to be flexibility to define the validation metric chosen? (i.e. MAE, etc.)

For any model to learn all the distributional parameters, it is essential to use a pre-specified distribution loss, which in most cases is the nll. As such, since we want to learn the full distribution, a point-loss metric, such as MAE/MSE is not currently planned to be included. Yet, you can derive any point measure, such as the MAE from the forecasted distribution.

As of now my tuning process is returning inf for each trial.

Hard to judge. Can you please provide a minimum working example to reproduce? What distribution are you using and can the data be properly approximated with it?

markpearl commented 2 years ago

@StatMixedML my confusion with that is nll is typically used for binary or multi-class classification, whereas my problem area is regression. I agree MAE can be derived from the test dataset during the inference process, but I think it would be confusing to interpret log likelihood as part of the training process, how do I know what value of nll forms a baseline that I can iterate from? In section 5.2.1 for the article going over the results for the munich rent dataset, all of the validation metrics used are for regression.

For the tuning process I tried to use the scipy.stats.kstest method to determine the best suited distribution from the list of available distributions.

D, p value for norm = 0.38085369493961546,0.0 D, p value for beta = 0.161471440177816,0.0 D, p value for gamma = 0.9992046266400899,0.0 D, p value for t = 0.10798722651271923,0.0

I'm not sure if I can rely on these results though, as the p-value is strictly 0. Whenever I took the minimum value of D, it yields the t distribution. Whenever I used Student T I'm now getting values for the tuning process:

[I 2022-06-27 15:24:40,516] Trial 0 finished with value: 14913.8016573 and parameters: {'eta': 0.030363087280807707, 'max_depth': 3, 'gamma': 1.5074442732613943e-06, 'subsample': 0.4857160809399238, 'colsample_bytree': 0.6366741608938994, 'min_child_weight': 211}. Best is trial 0 with value: 14913.8016573. [I 2022-06-27 15:25:31,873] Trial 1 finished with value: 14740.788098300001 and parameters: {'eta': 0.8013476947482925, 'max_depth': 7, 'gamma': 0.0004148938237251032, 'subsample': 0.3759312069680463, 'colsample_bytree': 0.3474570556327573, 'min_child_weight': 365}. Best is trial 1 with value: 14740.788098300001. [I 2022-06-27 15:26:27,362] Trial 2 finished with value: 15196.0383344 and parameters: {'eta': 0.0015590542353378792, 'max_depth': 1, 'gamma': 6.638115717563432e-05, 'subsample': 0.6559438243964698, 'colsample_bytree': 0.26827872970098154, 'min_child_weight': 87}. Best is trial 1 with value: 14740.788098300001. [I 2022-06-27 15:27:31,126] Trial 3 finished with value: 15177.3821631 and parameters: {'eta': 0.004547334157724475, 'max_depth': 6, 'gamma': 0.012346846888834317, 'subsample': 0.7847970112617345, 'colsample_bytree': 0.6418041740033543, 'min_child_weight': 306}. Best is trial 1 with value: 14740.788098300001. [I 2022-06-27 15:28:28,738] Trial 4 finished with value: 14963.0078844 and parameters: {'eta': 0.040946512023655214, 'max_depth': 4, 'gamma': 2.9779429635527073e-05, 'subsample': 0.2887880027234064, 'colsample_bytree': 0.3208686624698766, 'min_child_weight': 316}. Best is trial 1 with value: 14740.788098300001. [I 2022-06-27 15:29:28,653] Trial 5 finished with value: 15334.980780399997 and parameters: {'eta': 2.887517996963759e-05, 'max_depth': 5, 'gamma': 0.00013714188090138904, 'subsample': 0.4426913850409225, 'colsample_bytree': 0.39689222886296116, 'min_child_weight': 156}. Best is trial 1 with value: 14740.788098300001. [I 2022-06-27 15:30:32,658] Trial 6 finished with value: 15145.0560953 and parameters: {'eta': 0.001354429907013438, 'max_depth': 9, 'gamma': 11.638074587008026, 'subsample': 0.44853752176416084, 'colsample_bytree': 0.5459519142554762, 'min_child_weight': 57}. Best is trial 1 with value: 14740.788098300001.

Do you have recommendations for python users to select the best available distribution? There's a significant amount of skewness and kurtosis in the histogram as you can see here:

So a student-t distribution with properly tuned values for skewness and kurtosis would make sense to me, as I don't believe this can be represented as a normal distribution to the heavy tail.

StatMixedML commented 2 years ago

@markpearl Let me address some of your concerns

my confusion with that is nll is typically used for binary or multi-class classification, whereas my problem area is regression.

Using the nll is a standard way of evaluating probabilistic regression models and not only used for classification tasks. In fact, the MSE loss can be derived by assuming a normal distribution with homoscedastic variance of 1. Hence, whenever you train and evaluate using the MSE, you implicitly assume a Normal distribution, neglecting the variance. In principle, you can override the custom evaluating metric in xgboostlss with a MSE / MAE function and use this for evaluation. However, if your main interest is on the MSE / MAE, then you`d preferably use XGBoost directly, since XGBoostLSS models all distributional parameters, instead the mean only.

how do I know what value of nll forms a baseline that I can iterate from?

You would usually compare the nll across several potential distributions. As it is comprised as the sum of log-likelihood contribution, it also depends on the sample size. Hence, there is no baseline as such, only relative to others.

For the tuning process I tried to use the scipy.stats.kstest method to determine the best suited distribution from the list of available distributions. Do you have recommendations for python users to select the best available distribution?

I wouldn't use the KS-test for this. Rather, I suggest you fit an unconditional version of each of the distribution to the data and compare its nll. Choose the distribution with the lowest nll.

So a student-t distribution with properly tuned values for skewness and kurtosis would make sense to me, as I don't believe this can be represented as a normal distribution to the heavy tail.

While I agree on the kurtosis part, the StudentT, as currently available, does not have a skewness parameter. If needed, one would need to implement a skewed StudentT distribution. Also, given that your data seems to be massively skewed, maybe you want to log-transform the data first and then re-transform afterwards. I am afraid that any parametric distribution would have a hard time modelling the heavy skewness.

markpearl commented 2 years ago

Thanks Alex I was able to produce the above and iterate through each distribution to determine the lowest NLL, however it seems like you're hardcoding the return of only opt_params.params, and not returning the full dictionary. Without this I'm losing out on the what the nll value should be.

Are you able to change this so it also returns the full opt_params ?

StatMixedML commented 2 years ago

@markpearl At the end of each hyper-parameter tuning, you receive a short summary

This also includes the minimum average nll value across all k-folds. Is this what you mean? You can easily add the information, as shown in line 292 ff of your screenshot.

markpearl commented 2 years ago

Alex,

What I'm stating is that the method shouldn't return opt_param.parmas and it should return the full opt_param because I want to be able to access the nll for each distribution programmatically. I can't rely on the summary output to see the value, this is not scalable especially since I have multiple response variables to train.

StatMixedML commented 2 years ago

@markpearl You can use the internal eval function of xgboostlss. After hyper-parameter-otpimization, you would use the optimal parameters and re-train the model on the entire train data. In addition, you can use an evaluation dataset on which the model additionally evaluates the nll. The code is adapted from the example section and you would need to adjust the evaluation set.

###
# Imports
###
import numpy as np
import pandas as pd
import pkg_resources
import itertools
import shap
import math
import multiprocessing
from scipy.stats import norm
from xgboostlss.model import *
from xgboostlss.distributions.Gaussian import Gaussian
from xgboostlss.datasets.data_loader import load_simulated_data

###
# Data
###
train, test = load_simulated_data()
n_cpu = multiprocessing.cpu_count()
X_train, y_train = train.iloc[:,1:],train.iloc[:,0]
X_valid, y_valid = test.iloc[:,1:],test.iloc[:,0]
dtrain = xgb.DMatrix(X_train, label=y_train, nthread=n_cpu)
deval = xgb.DMatrix(X_valid, label=y_valid, nthread=n_cpu)

###
# Distribution
###
distribution = Gaussian                   
distribution.stabilize = "None"           
quant_sel = [0.05, 0.95]

###
# Hyper-Parameter Optimization
###
np.random.seed(123)
params = {"eta": [1e-5, 1],                   
          "max_depth": [1, 10],
          "gamma": [1e-8, 40],
          "subsample": [0.2, 1.0],
          "colsample_bytree": [0.2, 1.0],
          "min_child_weight": [0, 500]
         }
opt_params = xgboostlss.hyper_opt(params,
                                  dtrain=dtrain,
                                  dist=distribution,
                                  num_boost_round=10,       
                                  max_minutes=120,           
                                  n_trials=2,             
                                  silence=True) 

###
# Model Training using evaluation set
###
n_rounds = opt_params["opt_rounds"]
del opt_params["opt_rounds"]

# Add evaluation set
eval_set = [(dtrain,"train"), (deval,"eval")]
eval_result = {}

xgboostlss_model = xgboostlss.train(opt_params,
                                    dtrain,
                                    dist=distribution,
                                    num_boost_round=n_rounds,
                                    evals=eval_set, 
                                    evals_result=eval_result)

# Extract nll-value from evaluation set
nll_eval = eval_result["eval"]["NegLogLikelihood"][-1]
print(f" \n\n\n NLL of evaluation set of optimal boosting round: {nll_eval}")

The output woudl would look like this

markpearl commented 2 years ago

Awesome thanks for this!!

StatMixedML commented 2 years ago

@markpearl Glad it helped.

Can we close the issue?

markpearl commented 2 years ago

Yes! One last question I had Alex was around the use of Expectile regression. I seem to be getting NLL values of 14-17k for Student-T and BCT, so I'm assuming it's safe to say that a probabilistic distribution would not be effective in my case.

Would expectile give me good results to create prediction intervals for my case? Knowing that my distribution is so skewed?

markpearl commented 2 years ago

On top of this question, if I decided to use a probabilistic distribution such as BCT or Student-T, how would I be able to get the prediction interval results on a per observation basis for inference? I tried to use the pred_type as 'quantiles', and specific the quantiles as quant_sel = [0.05,0.10,0.20,0.30,0.40,0.60,0.70,0.80,0.95], but it gave me static results. Would I be able to use the pred_type as 'response' and get the sample from the distribution that's aligned with the timestamp column as well?

Thanks,

Mark

StatMixedML commented 2 years ago

Yes! One last question I had Alex was around the use of Expectile regression. I seem to be getting NLL values of 14-17k for Student-T and BCT, so I'm assuming it's safe to say that a probabilistic distribution would not be effective in my case.

I would agree. Given the heavy skewness, modelling the data without transformations would give any parametric model a hard time, especially when it comes to model stability.

Would expectile give me good results to create prediction intervals for my case? Knowing that my distribution is so skewed?

You can definitely give it a try. Just recall that for tau=0.5, expectiles equal mean regression, assuming a Gaussian distribution. What is known from the literature is that expectile intervals tend to be usually a little narrower as compared to quantile ones.

On top of this question, if I decided to use a probabilistic distribution such as BCT or Student-T, how would I be able to get the prediction interval results on a per observation basis for inference? I tried to use the pred_type as 'quantiles', and specific the quantiles as quant_sel = [0.05,0.10,0.20,0.30,0.40,0.60,0.70,0.80,0.95], but it gave me static results. Would I be able to use the pred_type as 'response' and get the sample from the distribution that's aligned with the timestamp column as well?

Both ways you are describing a valid ones, even though it sounds weird that the quantiles are constant. Maybe this has to do with the amount of skewness and the assumed distribution being inadequate. Let me know how things go.

StatMixedML / XGBoostLSS

Model Tuning: Validation Metric #20