elephaint / pgbm

Probabilistic Gradient Boosting Machines
Apache License 2.0
141 stars 20 forks source link

Could sample_weights and monotone_constraints be added to PGBM? #4

Closed flippercy closed 3 years ago

flippercy commented 3 years ago

Hi:

Is it possible to add sample_weights and monotone_constraints to the fit function like what lightgbm has? It will enable the algorithm to process weighted datasets and acknowledge domain knowledge.

Thank you.

elephaint commented 3 years ago

Hi,

I need to investigate a bit how much effort it takes, but sample_weights should be relatively easy (I think sample weights are mostly captured in the loss function, so this should be an easy addition). Monotone constraints: I don't know, need to look at!

May I ask what is your problem / setup? (out of curiosity, just interested in learning what users are using the package for. Don't feel obliged to share though)

flippercy commented 3 years ago

Hi @elephaint:

Thank you for the reply. I'd be happy to share some background information on the business problems I face and how I plan to use PGBM to tackle them.

On the one hand, sometimes the executives need probabilistic predictions rather than point predictions to make decisions. For example (pardon me, this is not the real problem I am working on but the logic is identical), assume we want to build a model to predict the market value of houses and compare the result with the listing prices by the owner to determine whether a house is overpriced. In this case, instead of the point prediction, the business leaders usually prefer to make strategies based on information such as "whether the listing price of this house is greater than the market value of 90% of houses with similar conditions (90th percentile)" or "whether the listing price is greater the mean of market value plus two times standard deviation". In this setting, PGBM will be a perfect tool as long as it can return the mean and standard deviation of predicted distribution for every data point.

On the other hand, our industry, financial service, is highly regulated by the US government. For example, if a consumer applied for a credit card but got rejected, we are obliged to send him/her a letter explaining the main reasons for the rejection. Under this case, monotonicity is a must because some domain knowledge or common sense has to be respected. For example, it is well accepted that the higher someone's debt-to-income ratio (DTI) the higher the risk. Therefore, for a model predicting the risk of applicants using DTI as a variable, the relationship between the value of DTI and the final predicted risk must be monotonically increasing (the higher the DTI, the higher the predicted risk); otherwise, if the relationship is U-shaped, we will have to send letters to some applicants stating that "You were rejected because your DTI was too high" while others "You were rejected because your DTI was too low". This will cause great confusion and won't be approved by the regulators.

Hope the information above helps. Let me know if you have more questions.

Best Regards

elephaint commented 3 years ago

Thanks for the explanations! Sorry for taking a bit more time than expected. Explanation makes sense - I've worked in business on pricing too and this is for me also one of the reasons for creating this package (I think more business executives should require probabilistic predictions instead of only point forecasts, but this is not yet very common in my experience).

Re. mean and std of the distribution, see the other issue. The learned mean and variance is available as an output now. Note that these should be (more or less) equal to the the empirical mean and variance of the output distribution (i.e. if you sample many points, the sample mean and variance will be equal to those output by the function).

Re. sample weights: This can be put in the loss function using the levels specifier. Example below for MSE loss function (this is similar as to how sample weights are implemented in LightGBM, for example).

def mseloss_objective(yhat, y, levels):
    gradient = (yhat - y) * levels
    hessian = torch.ones_like(yhat) * levels

    return gradient, hessian

and calling the regressor by:

weights_train = [weights of shape y_train]
weights_valid =  [weights of shape y_val]
model = PGBM()
model.train(train_set, objective=mseloss_objective, metric=[some_metric], valid_set=val_set, params=params, levels_train=weights_train, levels_valid=weights_valid)

Note that the levels specifier only works using the Torch backend. Hence, the weights should be a Torch tensor on the same device as the device you are learning with.

Re. the monotone constraints: this is a bit more tricky, I haven't found the time yet to work on that. I expect that to take a bit longer (probably somewhere in October).

Again, hope this helps,

Olivier

flippercy commented 3 years ago

Thank you Olivier! The upgrade really helps!

elephaint commented 3 years ago

Hi,

I have added the option to include monotone_constraints. For an example, see here (Torch version) or here (Numba version). It's still in beta, I'm seeing some situations where the constraints are not fully met, so make sure to doublecheck.

Basically, it works similar as in LightGBM or xgboost, just specify a list of length n_features where an entry of 1 corresponds to a positive monotonic constraint, a value of 0 denotes no constraint, and a value of -1 denotes a negative monotonic constraint. There should be a negligible impact on training speed; to improve accuracy you can set the parameter monotone_iterations to a higher value (default is 1), but this comes at the expense of slower training.

Hope this works for you!

Edit: make sure to upgrade to version 1.1 in order to get this functionality.

flippercy commented 3 years ago

Thank you! I will check it later next week.

Just curious: does the monotone constraint work with the point prediction only or also probabilistic prediction? If it works with the probabilistic prediction, how is it achieved and how does it impact the predicted mean and variance of each leaf? For example, if the direction of a variable is positive (+1), will it be the greater the value of this variable, the greater the mean of the distribution fitted?

Thanks a lot.

elephaint commented 3 years ago

The monotone constraints work with regard to the mean of the distribution. Hence, the algorithm e.g. guarantees that the mean of prediction for sample A will be less than the mean for sample B (if negative monotone constraint and feature value for sample A is less than for sample B). So yes, the last part you stated is correct.

I've released a new version (1.2) that fixes a few bugs (specifically for the Torch-GPU version and a bug relating to the calculation of monotone constraints - the latter now works as expected in all cases), but maybe more importantly I've also now included a sklearn wrapper. So, you can now simply do, e.g.

from pgbm import PGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
model = PGBMRegressor().fit(X_train, y_train)  
yhat_point = model.predict(X_test)
yhat_dist = model.predict_dist(X_test)

For the Numba version, just replace pgbm with pgbm_nb. This wrapper uses the standard mse loss and rmse evaluation metric, but you can supply own loss functions as a parameter. See also here for the PyTorch version and here for the Numba version for more details about parameters.

flippercy commented 3 years ago

@elephaint:

Thank you for the reply! Does monotone_constraints impact the predicted variance, too?

elephaint commented 3 years ago

Yes, in the sense that due to the monotone constraints, some splits are no longer allowed. Hence, this will result in different means/variances learned per tree.

To be complete, because we use continuous distributions (which are defined on the interval -inf to +inf), such as the Normal or Student-t, there is always a non-zero probability that some of the probabilistic forecasts for sample B are less than sample A despite a monotone constraint telling it to be the other way. On average however (i.e. if you take the mean of the forecast or equivalently the point forecast), it will obey the monotone constraint (because the mean is forced to obey the monotone constraint).

The below picture explains it better - suppose the distribution of forecasts for sample A is the left distribution, and the distribution of forecasts for sample B is the right distribution. Due to the monotone constraint, we have forced the mean of the distribution for sample B to always be to the right of the left distribution (sample A). However, there is always some non-zero overlap between these distributions (no matter how far away the means are from each other!).

Picture

What does this mean? This means that if you generate n forecasts for sample A and sample B, some forecasts for sample B will be less than sample A. How many? That depends on how far away the means are from each other and how large the variance is of each distribution.

Note that this is the correct and desired behaviour (for any probabilistic forecast of continuous variables).

flippercy commented 3 years ago

@elephaint:

Makes sense. Thank you very much for the reply!