Huber breakpoint parameter in quantile form?

BeitianMa commented 7 months ago

Consider

HuberRegressor = @load HuberRegressor pkg=MLJLinearModels

The parameter delta seems like an absolute value, i.e., model will treat the error terms which excess delta as outliers. My Question is, how to keep the 0.1% extreme value as outliers? Relative breakpoint values may be more common and intuitive than absolute values.

I have also considered HuberRegressor from sklearn in Python. In sklearn, the corresponding parameter is epsilon, which means the errors which excess epsilon times of standard deviation will be treated as outliers. It can not solve my problem because when error term is not from a normal distribution, the Z-value corresponding to 99.9% quantile is uncertain. An alternative method from stack overflow is iteratively searching for the corresponding epsilon, which is obviously time consuming for large dataset.

Now I have found the only perfect to meet the requirements of my program from sklearn GradientBoostingRegressor, one can set the loss parameters as "huber", and specify the quantile forms of breakpoints, it uses then algorithm from Friedman (2001). Test indicates that it is rather fast.

Unfortunately, after looking through the sklearn source code, I discovered that the corresponding HuberLoss is a private class and cannot be integrated with any model other than the GBRT model.

Thanks in advance and apologies for my poor English :)

tlienart commented 7 months ago

Hello @BeitianMa, either there's a bit of confusion or I misunderstood your message.

the delta parameter affects the loss function i.e. if you have a $y$ and a $X$ and I give you a coefficient $\beta$, it affects how to compute the loss $L_\delta(\beta; X, y)$. You can see exactly how that loss is computed by considering https://github.com/JuliaAI/MLJLinearModels.jl/blob/27b377257a3de81b899f9fc616c36046fc3942be/src/loss-penalty/robust.jl#L9 and https://github.com/JuliaAI/MLJLinearModels.jl/blob/27b377257a3de81b899f9fc616c36046fc3942be/src/loss-penalty/robust.jl#L39-L45

When the individual residuals are below $\delta$ in absolute value, the loss is quadratic, outside of it the loss is linear.

If I understand your question correctly, you would like to determine the $\delta$ dynamically:

fix a $\delta$
find $\beta$ using HuberRegressor
determine the percentage of outliers by checking the residuals that are $>\delta$
repeat until the percentage of outliers is $<=0.1%$

this is not supported by this library but should be reasonably easy to do with a for loop even if it's a bit ugly. You can try using the bracketing method to get the $\delta^\star$ (start with a pair of very small and very large delta, compute, split it in two, figure out which half to keep exploring etc until convergence.

ablaom commented 6 months ago

Closing as resolved, as far as I can tell. Feel free to re-open.

JuliaAI / MLJLinearModels.jl

Huber breakpoint parameter in quantile form? #155