Closed koaning closed 2 days ago
Hey @koaning, thanks for raising this. Is there a place where I can download the data?
The notebook links to this repository. It was originally found on Kaggle.
Thanks, sorry I missed that. I re-read the issue and the statement about boosting not being able to produce predictions out of the original target isn't true, it's true for regular decision trees and random forests, but boosting is an additive algorithm, so it can definitely produce values outside the original range. Here's an example:
import numpy as np
from sklearn.ensemble import HistGradientBoostingRegressor
rng = np.random.default_rng(seed=0)
X = rng.random((10_000, 4))
y = rng.choice([0, 1, 2], size=10_000, replace=True, p=[0.8, 0.1, 0.1])
model = HistGradientBoostingRegressor().fit(X, y)
preds = model.predict(X)
assert y.min() == 0
assert preds.min() < 0
d0h! @jmoralez yeah, you're right. Thanks for the example!
It might still be a good idea to warn folks about the fill_gaps
utility. But I will leave it up to you to make a new issue for that or to rename this one.
I'll open a new issue for that. Thanks!
What happened + What you expected to happen
During the probabl livestream last week (YT link here, notebook here), I may have stumbled on a bug. Figured that I should report it.
The short story is that while the input dataset has no negative values, some of the predicted values are negative. For a linear model this could make sense, but for a boosted tree model it does not. Tree models, after all, can only interpolate the training data. It is something that became a talking point during this segment of the livestream.
Possible cause
After diving a bit deeper I may have found a good lead on the cause too. My dataset has hourly data but there are a few timeslots missing. I am predicting number of people that leave a subway station and these stations can be closed during a few hours in the day. These rows do not show up in my original dataset. When I was using
mlforecast
this didn't give me any warnings but when I gave the dataset to TimeGPT I was prompted to usefill_gaps
to make sure that there are no missing rows.When I apply
fill_gaps
to my data before passing it toMLForecast
the results do not show negative numbers for the boosted tree model anymore. This suggests to me that it might be good to throw a similar warning message here? I am not completely aware of the Nixtla internals, so I might be missing an important detail here, but since silent warnings can be painful I figured I should at least write up this report here.Versions / Dependencies
mlforecast version 0.15.0
Reproduction script
I added a notebook link in the above description, as well as a YT link that shows the error. While reproduction could be useful, my current impression is that the main issue here is the fact that an error message is missing.
I figured setting a medium issue on this one. Silent failures can make the whole stack crumble but I have technically found a work-around.
Issue Severity
Medium: It is a significant difficulty but I can work around it.