koaning commented 4 days ago

What happened + What you expected to happen

During the probabl livestream last week (YT link here, notebook here), I may have stumbled on a bug. Figured that I should report it.

The short story is that while the input dataset has no negative values, some of the predicted values are negative. For a linear model this could make sense, but for a boosted tree model it does not. Tree models, after all, can only interpolate the training data. It is something that became a talking point during this segment of the livestream.

Possible cause

After diving a bit deeper I may have found a good lead on the cause too. My dataset has hourly data but there are a few timeslots missing. I am predicting number of people that leave a subway station and these stations can be closed during a few hours in the day. These rows do not show up in my original dataset. When I was using mlforecast this didn't give me any warnings but when I gave the dataset to TimeGPT I was prompted to use fill_gaps to make sure that there are no missing rows.

When I apply fill_gaps to my data before passing it to MLForecast the results do not show negative numbers for the boosted tree model anymore. This suggests to me that it might be good to throw a similar warning message here? I am not completely aware of the Nixtla internals, so I might be missing an important detail here, but since silent warnings can be painful I figured I should at least write up this report here.

Versions / Dependencies

mlforecast version 0.15.0

Reproduction script

I added a notebook link in the above description, as well as a YT link that shows the error. While reproduction could be useful, my current impression is that the main issue here is the fact that an error message is missing.

I figured setting a medium issue on this one. Silent failures can make the whole stack crumble but I have technically found a work-around.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

jmoralez commented 4 days ago

Hey @koaning, thanks for raising this. Is there a place where I can download the data?

koaning commented 3 days ago

The notebook links to this repository. It was originally found on Kaggle.

jmoralez commented 3 days ago

Thanks, sorry I missed that. I re-read the issue and the statement about boosting not being able to produce predictions out of the original target isn't true, it's true for regular decision trees and random forests, but boosting is an additive algorithm, so it can definitely produce values outside the original range. Here's an example:

import numpy as np
from sklearn.ensemble import HistGradientBoostingRegressor

rng = np.random.default_rng(seed=0)
X = rng.random((10_000, 4))
y = rng.choice([0, 1, 2], size=10_000, replace=True, p=[0.8, 0.1, 0.1])
model = HistGradientBoostingRegressor().fit(X, y)
preds = model.predict(X)
assert y.min() == 0
assert preds.min() < 0

koaning commented 2 days ago

d0h! @jmoralez yeah, you're right. Thanks for the example!

It might still be a good idea to warn folks about the fill_gaps utility. But I will leave it up to you to make a new issue for that or to rename this one.

jmoralez commented 2 days ago

I'll open a new issue for that. Thanks!

Nixtla / mlforecast

MLForecast and negative boosted tree predictions #457

What happened + What you expected to happen

Possible cause

Versions / Dependencies

Reproduction script

Issue Severity