Closed BeitianMa closed 6 months ago
Hello @BeitianMa, either there's a bit of confusion or I misunderstood your message.
the delta
parameter affects the loss function i.e. if you have a $y$ and a $X$ and I give you a coefficient $\beta$, it affects how to compute the loss $L_\delta(\beta; X, y)$. You can see exactly how that loss is computed by considering https://github.com/JuliaAI/MLJLinearModels.jl/blob/27b377257a3de81b899f9fc616c36046fc3942be/src/loss-penalty/robust.jl#L9 and https://github.com/JuliaAI/MLJLinearModels.jl/blob/27b377257a3de81b899f9fc616c36046fc3942be/src/loss-penalty/robust.jl#L39-L45
When the individual residuals are below $\delta$ in absolute value, the loss is quadratic, outside of it the loss is linear.
If I understand your question correctly, you would like to determine the $\delta$ dynamically:
this is not supported by this library but should be reasonably easy to do with a for loop even if it's a bit ugly. You can try using the bracketing method to get the $\delta^\star$ (start with a pair of very small and very large delta, compute, split it in two, figure out which half to keep exploring etc until convergence.
Closing as resolved, as far as I can tell. Feel free to re-open.
Consider
The parameter
delta
seems like an absolute value, i.e., model will treat the error terms which excessdelta
as outliers. My Question is, how to keep the 0.1% extreme value as outliers? Relative breakpoint values may be more common and intuitive than absolute values.I have also considered
HuberRegressor
fromsklearn
in Python. Insklearn
, the corresponding parameter isepsilon
, which means the errors which excessepsilon
times of standard deviation will be treated as outliers. It can not solve my problem because when error term is not from a normal distribution, the Z-value corresponding to 99.9% quantile is uncertain. An alternative method from stack overflow is iteratively searching for the correspondingepsilon
, which is obviously time consuming for large dataset.Now I have found the only perfect to meet the requirements of my program from sklearn
GradientBoostingRegressor
, one can set theloss
parameters as"huber"
, and specify the quantile forms of breakpoints, it uses then algorithm from Friedman (2001). Test indicates that it is rather fast.Unfortunately, after looking through the
sklearn
source code, I discovered that the correspondingHuberLoss
is a private class and cannot be integrated with any model other than the GBRT model.Thanks in advance and apologies for my poor English :)