ParallelGSReg / GlobalSearchRegression.jl

Julia's HPC command for automatic feature/model selection using all-subset-regression approaches
Other
18 stars 3 forks source link

JuliaCon 2019 - Review #4

Closed Nosferican closed 4 years ago

Nosferican commented 4 years ago

I am opening this issue to track some comments and issues part of the the review process going on in https://github.com/JuliaCon/proceedings-review/issues/53.

Issues

data = RDatasets.dataset("Ecdat", "Crime") categorical!(data, findall([!(typeof(col) <: AbstractVector{<:Number}) for col in eachcol(data)])) @test_throws(MethodError, gsreg(join(names(data)[3:end], ' '), data) )


- [x] While I enjoy the flexible inputs for formula, using `StatsModels.@formula` would be nice for `poly(x, 2)`, interactions, and handling categorical variables.
- [x] Document how linearly dependent combinations are handled
- [x] Rather than `print` to the `io`, implement a `show` method for the result. (See [example](https://github.com/Nosferican/Econometrics.jl/blob/master/src/structs.jl#L89))
dpanigo commented 4 years ago

Dear @Nosferican Thank you very much for your help. For the first issue please set using GlobalSearchRegression after addprocs(#), i.e.:

using Distributed addprocs(2) using Test, RDatasets, GlobalSearchRegression data = RDatasets.dataset("Ecdat", "Crime")

model = gsreg("CRMRTE ~ Prb*", data, criteria = [:aicc, :bic], modelavg = true)

For the second one.... You are right, we cannot handle categorical variables directly yet.... Your solution is very flexible, we will try to include some functionality about this ASAP.

As for documentation, we are improving our io page right now. Thanks again

dpanigo commented 4 years ago

Dear @Nosferican Regarding the two last issues 1) Perfect collinearity returns an error.... But this will be modified. Instead of discard some random variable (as Stata does) we will provide the user with a list of perfectly correlated variables ASAP. 2) Thank you very much, we will change print by show for the result.

vmari commented 4 years ago

Hey @Nosferican I'm working with the show vs print fix. Can you check this line? There we're using the Base.show function with io as a param. Is this enough or we're missing something?

dpanigo commented 4 years ago

@Nosferican: One more clarification about your advice "While I enjoy the flexible inputs for formula, using StatsModels.@formula would be nice for poly(x, 2), interactions, and handling categorical variables." In order to deal with combinatorial issues and parallelism we need to implemente our own interaction function. This is included in our next package (https://github.com/ParallelGSReg/ModelSelection.jl)

Nosferican commented 4 years ago

Try, Base.show(io::IO, result::GSRegResult) = print(io, to_string(result))

I haven't had much experience with Distributed. It seems I ended up hitting https://github.com/JuliaLang/julia/issues/28781 and a couple issues, but seems fine now with the clarification. Might be good to er on the side of indicating a line or two on the Distributed framework for Julia. Maybe Andreas can shine some light on it (especially as it relates to the performance).

As for the handling of categorical variables, using StatsModels.modelcols should give you the matrix easy to-plug in. For interactions, you might want to check how to handle those if for example overloading the default.

For the multicollinearity, I thought it was currently silently dropping the linearly dependent features,

data = DataFrame(y = rand(100), x1 = rand(100))
data[!,:x2] .= 2 * data.x1
data[!,:x3] .= 0.5 * data.x1
    gsreg("y x1 x2 x3", data)
Nosferican commented 4 years ago

In the following case using the data from the main example in the issue,

model = gsreg("CRMRTE ~ Prb*",
              data,
              criteria = [:aicc, :bic],
              modelavg = true)
propertynames(model)
model.nobs # 630
model.time # nothing
model.parallel # nothing
model.criteria # [:aicc, :bic, :r2adj]

I am not sure if the model criteria is leaking the r2adj.

nicomzn commented 4 years ago

Dear @Nosferican , Thank you very much for all your suggestions. GlobalSearchRegression.jl has been updated. There is a new release (v1.0.4). We have:

  1. fixed the leaked r2adj bug;
  2. Included and error message for perfect multicollinearity;
  3. Write a new documentation with some additional details about using Distributed before GlobalSearchRegression.
  4. Included clear statements about limitations (i.e. categorical variables, string variables and multicollinearity) - some of these issues are going to be solved in our new package ModelSelection.jl -;
  5. Added the coverage code;
  6. Improved the runtest.jl file;
  7. Finally, fixed the show-print issue.

Thank you very much for all your help. Nicolas

Nosferican commented 4 years ago

Great! I will take a look at it when I get a chance in the next couple days. Happy to be of help.

Nosferican commented 4 years ago

There seems to still be an issue with this,

using RDatasets, GlobalSearchRegression
data = RDatasets.dataset("Ecdat", "Crime")
model = gsreg("CRMRTE ~ Prb*", data, criteria = [:aicc, :bic], modelavg = true)
KeyError: key :r2adj not found

The docs are good. I only noticed that gsreg is still without docstring. Would be good to add examples to it through jldoctest. I believe the v1.0.4 didn't make it to the registry. Does the repo have Registrator.jl / TagBot enabled?

dpanigo commented 4 years ago

@Nosferican, We have introduced all your advices in a new relase (v1.0.6).

We are only waiting for human interaction to register our v1.0.4 realease first.... but the v1.0.6 version is alrady available using "add https://github.com/ParallelGSReg/GlobalSearchRegression.jl" We hope it will also be registered in a few days! Thanks again José

Nosferican commented 4 years ago

Thanks for the update. Will check the changes early this week.

dpanigo commented 4 years ago

Thanks to you José

El lun., 21 oct. 2019 a las 1:00, José Bayoán Santiago Calderón (< notifications@github.com>) escribió:

Thanks for the update. Will check the changes early this week.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ParallelGSReg/GlobalSearchRegression.jl/issues/4?email_source=notifications&email_token=AI3QGRVPZ6MKAWBMB3YZWN3QPUSPRA5CNFSM4I37G66KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBY7O2I#issuecomment-544339817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI3QGRRXAJDUSXUZ54IRWQ3QPUSPRANCNFSM4I37G66A .

-- Demian T. Panigo https://www.researchgate.net/profile/Demian_Panigo Lic. en Economía, UNLP, Master en Cs Sociales, UBA, Doctor en Economía, EHESS-ENS (Paris) Investigador Independiente del CONICET Docente investigador de la UNM, la UNQ, la UNDAV y la UNLP.

Nosferican commented 4 years ago

Seems good. Hope the feedback was useful.

nicomzn commented 4 years ago

@Nosferican Jose the last release is already registered. We can use the updated package just with: pkg > add GlobalSearchRegression