JuliaAI / MLJLinearModels.jl

Generalized Linear Regressions Models (penalized regressions, robust regressions, ...)
MIT License
80 stars 13 forks source link

Testing quantreg vs MLJLinearModels #147

Closed luboshanus closed 1 year ago

luboshanus commented 1 year ago

Hi,

I was a bit force to implement/rewrite quantile regression as linear programming (LP) model because results obtained using this package compared to a MATLAB code I had from a paper were not same/similar. Matlab used LP and at first I did not want to code LP in Julia, so I took MLJLinearModels with its possibility of L1 penalty for quantile regression. Nonetheless, different results put me to rewrite quantile regression as LP, which I also wanted to compare with this MLJLinearModels' simple NoPenalty QuantileRegression.

Here is an example from stackoverflow using R and an example using MLJLinearModels: Link: https://stats.stackexchange.com/questions/384909/formulating-quantile-regression-as-linear-programming-problem

This is simple quantile regression, on these data. Not LP model. In R using quantreg:

> library(quantreg)
> base <- read.table("http://freakonometrics.free.fr/rent98_00.txt", header = TRUE)
> attach(base)
The following objects are masked from base (pos = 3):

    area, eboden, glocation, invarea, nkitchen, pkitchen, rent_euro,
    rentsqm_euro, tlocation, year01, yearc, yearc2, yearc3

The following objects are masked from base (pos = 7):

    area, eboden, glocation, invarea, nkitchen, pkitchen, rent_euro,
    rentsqm_euro, tlocation, year01, yearc, yearc2, yearc3

> tau <- 0.3
> rq(rent_euro ~ area + yearc, tau = tau, data = base)
Call:
rq(formula = rent_euro ~ area + yearc, tau = tau, data = base)

Coefficients:
 (Intercept)         area        yearc 
-5542.503252     3.978135     2.887234 

Degrees of freedom: 4571 total; 4568 residual

In Julia:

julia> using CSV, DataFrames

julia> dataset = CSV.read(download("http://freakonometrics.free.fr/rent98_00.txt"), DataFrame)
4571×13 DataFrame
  Row │ rent_euro  rentsqm_euro  area   yearc    glocation  tlocation  nkitchen  pkitchen  eboden  year01  yearc2   yearc3   invarea   
      │ Float64    Float64       Int64  Float64  Int64      Int64      Int64     Int64     Int64   Int64   Int64    Float64  Float64   
──────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
    1 │   120.614       3.44611     35   1939.0          0          0         0         0       1       0  3759721   7.29e9  0.0285714
    2 │   435.672       4.18915    104   1939.0          0          0         0         0       0       0  3759721   7.29e9  0.0096154
    3 │   354.683      12.2304      29   1971.0          1          0         0         0       1       0  3884841   7.66e9  0.0344828
    4 │   282.08        7.23281     39   1972.0          1          0         0         0       1       0  3888784   7.67e9  0.025641
    5 │   804.824       8.29716     97   1985.0          0          0         0         0       0       0  3940225   7.82e9  0.0103093
  ⋮   │     ⋮           ⋮          ⋮       ⋮         ⋮          ⋮         ⋮         ⋮        ⋮       ⋮        ⋮        ⋮         ⋮
 4567 │   430.487       7.97104     54   1966.0          0          0         0         0       0       1  3865156   7.6e9   0.0185185
 4568 │   396.384       7.07628     56   1971.0          1          0         0         0       0       1  3884841   7.66e9  0.0178571
 4569 │   286.722       4.5505      63   1948.0          1          0         0         0       0       1  3794704   7.39e9  0.015873
 4570 │   358.416      11.2024      32   1966.0          0          0         0         0       1       1  3865156   7.6e9   0.03125
 4571 │   673.842       9.90884     68   1960.0          1          0         0         0       0       1  3841600   7.53e9  0.0147059
                                                                                                                      4561 rows omitted

julia> using MLJLinearModels

julia> tau = 0.3
0.3

julia> mlj_qreg = MLJLinearModels.QuantileRegression(tau; fit_intercept=true, scale_penalty_with_samples=false, penalty=:none)
GeneralizedLinearRegression{RobustLoss{QuantileRho{0.3}}, NoPenalty}
  loss: RobustLoss{QuantileRho{0.3}}
  penalty: NoPenalty NoPenalty()
  fit_intercept: Bool true
  penalize_intercept: Bool false
  scale_penalty_with_samples: Bool false

julia> mlj_beta = MLJLinearModels.fit(mlj_qreg, Matrix(dataset[!,[:area, :yearc]]), dataset[!,:rent_euro])
3-element Vector{Float64}:
  5.566200326965212
  0.07844799525061612
 -0.0005985624309847961

It might help with an idea of tests or comparison with quantreg. Could you please verify or help, what is the problem with results? Coefficients are at different scale and not only that. :) Thanks

luboshanus commented 1 year ago

Just tried it with MLJ, but it uses same setup.

using CSV, DataFrames
dataset = CSV.read(download("http://freakonometrics.free.fr/rent98_00.txt"), DataFrame)

x0 = [ones(size(dataset,1)) Matrix(dataset[!,[:area, :yearc]])]
y0 = dataset[!,:rent_euro]
df_mlj = DataFrames.DataFrame([y0 x0],[:rent_euro, :intc, :area, :yearc])
ym, Xm = unpack(df_mlj, ==(:rent_euro), in([:intc, :area, :yearc])) #; rng=123

using MLJ
QuantileRegressor = @load QuantileRegressor pkg=MLJLinearModels
mach = fit!(machine(QuantileRegressor(;delta=0.3, fit_intercept=false, penalty=:none), Xm, ym))
fitted_params(mach)

Fitted parameters:

julia> fitted_params(mach)
(coefs = [:intc => -0.000598562430985105, :area => 5.566200326965857, :yearc => 0.07844799525059157],
 intercept = nothing,)

This is might be no surprise, since these two from MLJ should be the same models.

However, using QuantileRegressions, the result is same as from LP model and as the one from R

# _ test with QuantileRegressions
using QuantileRegressions
ResultQR = qreg(@formula(rent_euro ~ area + yearc), df_mlj, .3)

output:

julia> ResultQR = qreg(@formula(rent_euro ~ area + yearc), df_mlj, .3)
StatsModels.TableRegressionModel{QuantileRegressions.QRegModel, Matrix{Float64}}

rent_euro ~ 1 + area + yearc

Coefficients:
─────────────────────────────────────────────────────────
             Quantile     Estimate    Std.Error   t value
─────────────────────────────────────────────────────────
(Intercept)       0.3  -5542.5      169.111      -32.7743
area              0.3      3.97813    0.0838414   47.4483
yearc             0.3      2.88723    0.085967    33.5854
─────────────────────────────────────────────────────────

Thus, could you help with MLJ quantile regression estimation? Thank you.

tlienart commented 1 year ago

Hello!

Thanks for the report. I investigated this a little bit and a core part of the problem seems to be around tau. I'm running a bit out of time to fully complete this so just writing notes quickly here and will think of a way to fix this later. It seems tau is on the "wrong side", i.e. setting 0.3 actually amounts to 0.7 and vice versa.

To debug this stuff (explaining here as it might be of interest to you), a good thing is to check the objective function associated with the regression object:

y = Vector(dataset[!,:rent_euro])
X = Matrix(dataset[!,[:area, :yearc]])

qr = QuantileRegression(0.3; penalty=:none)
qr2 = QuantileRegression(0.7; penalty=:none)
obj_qr  = objective(qr, X, y)
obj_qr2 = objective(qr2, X, y)

mlj_beta = [
    5.566200326965212,
    0.07844799525061612,
   -0.0005985624309847961
]

mlj_beta2 = fit(qr2, X, y)

obj_qr(mlj_beta)   # 381_972
obj_qr(mlj_beta2)  # 226_639
obj_qr2(mlj_beta)  # 245_533
obj_qr2(mlj_beta2) # 380_221

this shows that mlj_beta2 is a better fit for obj_qr than mlj_beta and vice versa (which doesn't make sense of course). So there's a swap somewhere which needs to be fixed.

As a note, if you compute the same objective functions with the coefs obtained via R you'd get

obj_qr(r_beta)    # 207_551
obj_qr2(r_beta)  # 354_518

which makes sense. Note that the objective obtained with the MLJLM is still 10% worse than what you get with R. This may get decreased further once I've fixed the symmetry thing. I'm somewhat less worried about 10% though as the solver used is completely different (at least less worried than the 84% worse we currently get...) though I'd prefer this to be just a few percents (might be able to get there with standardisation, solver parameter tuning etc)

luboshanus commented 1 year ago

Thanks for your time looking at it. I understand that one should look at the objective, although, the data are quite trivial to solve and MLJ QuantileRegression should provide same results for coefficient even for non-standardised data, right?

I have switched the tau and just provide here the output. Coefficients are still at different scale. I don't exactly know, what and where things are happening in the routine. Nonetheless, objective function is one thing, the benchmark data/output is the other.)

Hope you will find it easily in the solver/setup of Quantile regression.

tau2 = 0.7
mlj_qreg2 = MLJLinearModels.QuantileRegression(tau2; fit_intercept=true, scale_penalty_with_samples=false, penalty=:none)
mlj_beta2 = MLJLinearModels.fit(mlj_qreg2, Matrix(dataset[!,[:area, :yearc]]), dataset[!,:rent_euro])

Coefs:

julia> mlj_beta2 = MLJLinearModels.fit(mlj_qreg2, Matrix(dataset[!,[:area, :yearc]]), dataset[!,:rent_euro])
3-element Vector{Float64}:
  3.2583091596534843
  0.07749778937937656
 -0.0004959079331218055
tlienart commented 1 year ago

to be clear (maybe I wasn't in my previous message) what I reported does show that there is an issue which should be fixed (the tau stuff) in the package. After fixing it you may still get pretty different results and this is not necessarily a bug (especially if the objective function at the two points is roughly the same).

The point about standardising etc, has to do with how similar you can expect the objective function to be after interruption of two different solvers run with their stopping rules. If you have a nice convex function with two perfectly tuned solvers, they will/should converge to the same spot; what I'm saying here is that it almost certainly isn't the case here (even though the function is obviously convex). My point is just that even if everything is set properly with good defaults, you may end up stopping in an area which is not a minima but seems like one to the solver (e.g. a really flat valley). The parameters at that point may be completely different (including very different magnitude) from the ones you'd get with some other method (eg LP here). You get that all the time with neural nets for instance (though there the objective function is of course non convex). Standardising helps the objective function be "nicer" and can help reduce the gap between close-to-optimal points.

luboshanus commented 1 year ago

Thanks for your comment. I do understand what you write about optimization. I wish this package and algorithm would converge to results close to the LP one since the computation here is much faster than JuMP's LP. Maybe it is the point of different results. I look forward to seeing how the optimization behaves when the thing with tau is changed. Thanks anyway!

tlienart commented 1 year ago

Ok so the issue with tau is fixed in #148 which is good. I'll merge and do a patch release.

For the comparison of QR here's a couple of notes:

This just confirms the discussion we had earlier: in this instance, it looks like the method stops in a non-optimal zone. In fact passing solver=LBFGS(optim_options=Optim.Options(show_trace=true)) shows that the iterates get stuck. Scaling the covariates by n (which is probably a good idea anyway) gets you closer (~3.8% worse compared to IP).

What could be interesting is to check how well what you get from either method generalises (e.g. if you split the data 80/20, use 80 to train and predict on 20, how well does it do).

I hope this clarifies what's going on and is a bit helpful! I'll close this now as I don't think there's anything left to fix in the package but feel free to reopen if you still have questions.

ablaom commented 1 year ago

Just want to add that I appreciate the time spent by @luboshanus and @tlienart on this issue.