Vinnie-Palazeti commented 2 years ago

Describe the bug I am returning predictions intervals that do not contain the point predictions. The data I am working with is lognormally distributed and scaled-down, as shown below, by a factor of 10. Am I misunderstanding something fundamental about this type of data distribution? or should the prediction intervals contain the point predictions?

To Reproduce

import torch
import pgbm
import numpy as np
from scipy.stats import halfnorm
import matplotlib.pyplot as plt

scale = True

def objective(yhat, y, sample_weight=None):
    gradient = (yhat - y)
    hessian = torch.ones_like(yhat)
    return gradient, hessian

def rmseloss_metric(yhat, y, sample_weight=None):
    loss = (yhat - y).pow(2).mean().sqrt()
    return loss

params = {
      'min_split_gain':0,
      'min_data_in_leaf':2,
      'max_leaves':6,
      'max_bin':64,
      'learning_rate':0.1,
      'n_estimators':60,
      'verbose':2,
      'feature_fraction':1,
      'bagging_fraction':1,
      'seed':2008,
      'reg_lambda':1,
      'derivatives':'exact', 
      'distribution':'lognormal' 
}

if scale:
    s = np.random.lognormal(0.011, 0.6, 1000) / 10
else: 
    s = np.random.lognormal(0.011, 0.6, 1000)

eps = halfnorm.rvs(loc=0.0013, scale=0.001, size=1000, random_state=2008)
x_tmp = ((s*2 + 100)/ 2) + eps
eps = halfnorm.rvs(loc=0.0013, scale=0.001, size=1000, random_state=2007)
x_tmp_2 = (s + 5) + eps

train_data = (
    np.vstack([x_tmp, x_tmp_2]).T, 
    s 
)

model = PGBM()
model.train(
    train_data, 
    objective = objective, 
    metric = rmseloss_metric, 
    params = params
)

# test data
np.random.seed(20)
if scale:
    s_test = np.random.lognormal(0.011, 0.6, 1000) / 10
else: 
    s_test = np.random.lognormal(0.011, 0.6, 1000)

eps = halfnorm.rvs(loc=0.0013, scale=0.001, size=1000, random_state=2009)
x_tmp_test = ((s*2 + 100)/ 2) + eps
eps = halfnorm.rvs(loc=0.0013, scale=0.001, size=1000, random_state=2006)
x_tmp_test_2 = (s + 5.6) + eps

x_tmp_t = np.vstack([x_tmp_test, x_tmp_test_2]).T

yhat_dist_pgbm = model.predict_dist(x_tmp_t)
yhat_point_pgbm = model.predict(x_tmp_t)

plt.figure(figsize=(20,10))
plt.rcParams.update({'font.size': 10})
plt.plot(s_test, 'o', label='Actual')
plt.plot(yhat_point_pgbm.numpy(), 'ko', label='Point prediction PGBM')
plt.fill_between(np.arange(len(s_test)), 
                 yhat_dist_pgbm.min(0)[0].numpy(), 
                 yhat_dist_pgbm.max(0)[0].numpy(), 
                 color="#b9cfe7", label='Uncertainty')

plt.title('Probabilistic Predictions - PGBM')
plt.xlabel('Sample')
plt.ylabel('Prediction')
plt.legend(ncol=3);

with scale = True

with scale = False

Screen Shot 2022-04-13 at 9 41 42 AM

elephaint commented 2 years ago

Hi, thanks for reporting! The prediction intervals should contain the point predictions, so there might be a bug here. Let me check your code and the source code.

Vinnie-Palazeti commented 2 years ago

@elephaint forgot to add the scale = False image.

I haven't looked through the pgbm.py file closely, so sorry for not being helpful. Though, even in the second image there are point predictions outside of the plotted bounds. I did not noticed this before.

elephaint commented 2 years ago

Hi,

There was an error in how the lognormal distribution was fitted against the obtained empirical mean and variance from the algorithm, which would only manifest itself if the underlying location of the normal distribution was less than zero. This is the case when scale=True in your example, but not when scale=False. Anyways, I fixed the bug and there is a new version 1.7 that does display the correct behaviour in my offline test:

githubissue

Note that it can be the case of course that the observed values are outside the prediction intervals. However, the means will now be included in the prediction intervals. Hope that helps and that your error is fixed.

Vinnie-Palazeti commented 2 years ago

@elephaint I really appreciate the quick response. Thank you & be well!

elephaint / pgbm

Prediction Intervals for Scaled Lognormal Distribution #13

with scale = True

with scale = False