JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.78k stars 157 forks source link

LogisticClassifier pkg = MLJLinearModels computes a number of coefficients but not the same number of mean_and_std_given_feature #492

Closed drcxcruz closed 4 years ago

drcxcruz commented 4 years ago

Describe the bug

I am using LogisticClassifier pkg = MLJLinearModels. I would expect each coefficient to have a corresponding mean_and_std_given_feature. However, mean_and_std_given_feature is missing for OneHotEncoder coefficients.

To Reproduce Let us use the same data set as in https://github.com/alan-turing-institute/MLJ.jl/issues/489#issuecomment-612882956.

#Y is already a  vector of 0s and 1s.  Only using X[:ethnicity, :score].  that is, only two columns for testing

        yc = categorical(y[:, 1])
        @pipeline LogisticRegPipe(
            std = Standardizer(),
            hot = OneHotEncoder(),
            reg = LogisticClassifier(),
        )

        LogisticModel = machine(LogisticRegPipe(), X, yc)
        fit!(LogisticModel)
        fp = fitted_params(LogisticModel).fitted_params
        ŷ = MLJ.predict(LogisticModel, X)
        yhatResponse = [pdf(p, maximum(y)) for p in ŷ]
        residuals = y .- yhatResponse

       coefs = fp[1].coefs
       mean_and_std_given_feature = fp[3].mean_and_std_given_feature

       println(
        "fp[3].mean_and_std_given_feature ",
        fp[3].mean_and_std_given_feature,
        "   ",
        typeof(fp[3].mean_and_std_given_feature),
     )
     println("names X ", names(X))
     println(coefs)

#fp[3].mean_and_std_given_feature Dict(:score => (53.443939401645856, 8.05828188259464))   Dict{Symbol,Tuple{Float64,Float64}}
#names X [:ethnicity, :score]
#coefs [-0.43851241719212836, -0.12750153569721762, 0.5660139626794835, 0.9966932091483689] typeof Array{Float64,1}

Expected behavior

For every coefficient to have a mean_and_std_given_feature even for hot encoded variables. Also, the what coefficient goes with what mean_and_std_given_feature should be clear for hot encoders.

To get around this issue, I hot encoded X myself before calling MLJ.

julia> schema(Xhot) ┌────────────────────┬─────────┬────────────┐ │ .names │ .types │ _.scitypes │ ├────────────────────┼─────────┼────────────┤ │ gender │ Float64 │ Continuous │ │ score │ Float64 │ Continuous │ │ fcollege │ Float64 │ Continuous │ │ mcollege │ Float64 │ Continuous │ │ home │ Float64 │ Continuous │ │ urban │ Float64 │ Continuous │ │ unemp │ Float64 │ Continuous │ │ wage │ Float64 │ Continuous │ │ tuition │ Float64 │ Continuous │ │ income │ Float64 │ Continuous │ │ region │ Float64 │ Continuous │ │ ethnicity_afam │ Float64 │ Continuous │ │ ethnicity_hispanic │ Float64 │ Continuous │ └────────────────────┴─────────┴────────────┘

99×13 DataFrame │ Row │ gender │ score │ fcollege │ mcollege │ home │ urban │ unemp │ wage │ tuition │ income │ region │ ethnicity_afam │ ethnicity_hispanic │ │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ ├─────┼─────────┼─────────┼──────────┼──────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────────────┼────────────────────┤ │ 1 │ 0.0 │ 39.15 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ 6.2 │ 8.09 │ 0.88915 │ 0.0 │ 0.0 │ 0.0 │ 0.0 │ │ 2 │ 1.0 │ 48.87 │ 1.0 │ 0.0 │ 0.0 │ 0.0 │ 6.2 │ 8.09 │ 0.88915 │ 1.0 │ 0.0 │ 0.0 │ 0.0 │ │ 3 │ 0.0 │ 48.74 │ 1.0 │ 0.0 │ 0.0 │ 0.0 │ 6.2 │ 8.09 │ 0.88915 │ 1.0 │ 0.0 │ 0.0 │ 0.0 │ │ 4 │ 0.0 │ 40.4 │ 1.0 │ 0.0 │ 0.0 │ 0.0 │ 6.2 │ 8.09 │ 0.88915 │ 1.0 │ 0.0 │ 1.0 │ 0.0 │

but with my hot encoded Xhot, the fit! call is generating the error below. it seems to incorrectly convert the float columns with every few unique values to INT. score column by itself is able to fit fine but adding the gender column breaks the fit call.

└ No new code loaded. ┌ Info: A model type "LogisticClassifier" is already loaded. └ No new code loaded. [ Info: Training Machine{LogisticRegPipe3} @ 1…87. train_args = MLJBase.Source[Source{:input} @ 4…34, Source{:target} @ 1…44] mach.model = LogisticClassifier @ 9…85 [ Info: Training NodalMachine{LogisticClassifier} @ 1…87. ERROR: LoadError: MethodError: no method matching fit(::MLJLinearModels.GeneralizedLinearRegression{MLJLinearModels.LogisticLoss,MLJLinearModels.ScaledPenalty{MLJLinearModels.LPPenalty{2}}}, ::Array{Any,2}, ::Array{Int64,1}; solver=MLJLinearModels.LBFGS()) Closest candidates are: fit(::MLJLinearModels.GeneralizedLinearRegression, ::AbstractArray{#s54,2} where #s54<:Real, ::AbstractArray{#s12,1} where #s12<:Real; solver) at C:\Users\BCP.juliapro\JuliaPro_v1.4.0-1\packages\MLJLinearModels\4VdUV\src\fit\default.jl:38 Stacktrace: [1] fit(::LogisticClassifier, ::Int64, ::DataFrame, ::CategoricalArray{Float64,1,UInt32,Float64,CategoricalValue{Float64,UInt32},Union{}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.0-1\packages\MLJLinearModels\4VdUV\src\mlj\interface.jl:57 [2] fit!(::NodalMachine{LogisticClassifier}; rows::Function, verbosity::Int64, force::Bool) at C:\Users\BCP.juliapro\JuliaPro_v1.4.0-1\packages\MLJBase\ESDzL\src\machines.jl:183 [3] fit!(::Node{NodalMachine{LogisticClassifier}}; rows::Nothing, verbosity::Int64, force::Bool) at C:\Users\BCP.juliapro\JuliaPro_v1.4.0-1\packages\MLJBase\ESDzL\src\composition\networks.jl:339
[4] (::MLJBase.var"#_fit#133"{Node{NodalMachine{LogisticClassifier}},Tuple{LogisticClassifier},MLJBase.Source{:input}})(::LogisticRegPipe3, ::Int64, ::DataFrame, ::CategoricalArray{Float64,1,UInt32,Float64,CategoricalValue{Float64,UInt32},Union{}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.0-1\packages\MLJBase\ESDzL\src\composition\composites.jl:223 [5] fit(::LogisticRegPipe3, ::Int64, ::DataFrame, ::CategoricalArray{Float64,1,UInt32,Float64,CategoricalValue{Float64,UInt32},Union{}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.0-1\packages\MLJBase\ESDzL\src\composition\composites.jl:384 [6] fit!(::Machine{LogisticRegPipe3}; rows::Nothing, verbosity::Int64, force::Bool) at C:\Users\BCP.juliapro\JuliaPro_v1.4.0-1\packages\MLJBase\ESDzL\src\machines.jl:183 [7] fit! at C:\Users\BCP.juliapro\JuliaPro_v1.4.0-1\packages\MLJBase\ESDzL\src\machines.jl:146 [inlined]

Additional context I like MLJ a lot and we are implementing some analytics with MLJ. It is time to move away from R.

thanks Versions

ablaom commented 4 years ago

Thanks for posting and thanks for the feedback!

The mean_and_std_given_feature dictionary to which you refer has actually nothing to do with the classifier. It is the fitted parameters for the Standardizer (only the :score was standardized because this was the only Continuous feature before one-hot encoding). The named tuple FP=fitted_params(LogisticModel) has two keys, machines and fitted_params (which is true for any pipeline model). FP.fitted_params[i] is the fitted params for FP.machines[i] for each i. The only learned parameters for the classifier are the coefficients and intercept.

This has been a source of confusion before. Perhaps we make the fitted_params of composite models a dictionary, keyed on machines, with the learned params as values?

Additional information that is not strictly part of learned parameters goes in the machines's report R = report(LogisticModel). This is a named tuple with machines and reports as keys and follows the same pattern. I see that for LogisticModel nothing is returned in the report. If you have information you would like there (eg, deviance, ste, and so forth - for each coefficient) you should open an issue at MLJLinearModels requesting this.

cc @tlienart

drcxcruz commented 4 years ago

Thank for your time and expertise. I am still a bit puzzled about it.

I trying to translate the following R code into MLJ. I am using LinearRegressor pkg = MLJLinearModels for the Gaussian case which is a linear regression. I am using LogisticClassifier pkg = MLJLinearModels for the binomial case which is a binary classifier. The glm function in R computes the standard deviation of the point-estimate per coefficient. The glm function in R does it for both floats and dummy variables columns in X. Is there any way to obtain the same information in MLJ as in "coefficientsvar <- summary(linm)$coefficients[,2][-1]" in R? If not, as you suggested, should I open an issue at MLJLinearModels requesting this?

Please me know of any comment and/or corrections in my understanding. thank you!!!

linm <- glm( Y ~ X -1, family= if(is.factor(Y)) "binomial" else "gaussian", control=glm.control(maxit=10,epsilon=10^(-6)))
coefficients <- coef(linm)[-1]
coefficientsvar <- summary(linm)$coefficients[,2][-1]
pred <- predict(linm, type="response")
ablaom commented 4 years ago

Thank for your time and expertise. I am still a bit puzzled about it.

I trying to translate the following R code into MLJ. I am using LinearRegressor pkg = MLJLinearModels for the Gaussian case which is a linear regression. I am using LogisticClassifier pkg = MLJLinearModels for the binomial case which is a binary classifier. The glm function in R computes the standard deviation of the point-estimate per coefficient. The glm function in R does it for both floats and dummy variables columns in X. Is there any way to obtain the same information in MLJ as in "coefficientsvar <- summary(linm)$coefficients[,2][-1]" in R? If not, as you suggested, should I open an issue at MLJLinearModels requesting this?

Yes, precisely. Just copy and paste the above in an issue at MLJLinearModels. Thanks

Have closed the issue but feel free to continue discussion.

tlienart commented 4 years ago

so no MLJLinearModels does not offer this, partly on purpose. MLJLinearModels makes very little assumptions on the context and just serves as a tool to solve optimization problems (inverse problem approach), returning variances makes sense if you make statistical assumptions about your error model etc, this is the realm of statistics which MLJLinearModels tries to stay clear of. GLM.jl does provide this information however (and supports binary classification) so you might want to try that instead?

Now on a more positive note, there could be some secondary functions in MLJLM that return these "variance estimates", but I'm against doing this by default 😄

drcxcruz commented 4 years ago

How long would it take to implement and release a secondary function for the coefficient variances? I would wait for it if it is not too long. I would rather use MLJ and not use GLM.jl for consistency. My project is already using other MLJ models. I think a few secondary functions should increase the usability of MLJ perhaps resulting in more MLJ users. thank you for your time

tlienart commented 4 years ago

@drcxcruz I apologise for the confusion, you can already use GLM.jl from MLJ and retrieve these elements.

tlienart commented 4 years ago

LogisticClassifier pkg=GLM will do what you want (you'll get the deviation etc in the report)

Edit oops nope so it's LinearBinaryClassifier pkg=GLM (as you can specify Logit but also Probit link)

drcxcruz commented 4 years ago

hi again,

just double checking. would LinearBinaryClassifier pkg=GLM do both Logistic classification and linear regression? if yes, how does we define which of the two in MLJ? is it via a pipeline? or does the scientific type of target Y determine which of the two (what would nicer)? or do we use LinearRegressor pkg = GLM for the linear regression case.

I do not see this case in MLJ Lab tutorials. Let's cut a deal. If you explain it here in some detail with MLJ code, I will create a Jupiter lab explaining how to use GLM from MLJ. If you like parts of my lab, you can add it the the official Lab MLJ tutorials document. it is a simple but fundamental example that should be in the official Lab MLJ tutorials document.

thank you so much

tlienart commented 4 years ago

or do we use LinearRegressor pkg = GLM for the linear regression case.

yes

If you explain it here in some detail with MLJ code

The code will be the same as for MLJLinearModels with a single difference, the output of predict will have each entry be a distribution object that can be sampled. So if you feed a test matrix X of size 10 * p it will return a vector of 10 elements each of which is a Distribution object.

That Distribution object can then be queried for things like mean and variance.

Now in terms of fit informations, GLM returns more things https://github.com/alan-turing-institute/MLJModels.jl/blob/8b9f8ed89d5ca374f493ce83d6030dfc9cb8074b/src/GLM.jl#L60-L63

So once you've fitted a machine with

fit!(mach, rows=train)

you can recover the r = report(mach) and query r. deviance, r.dof_residual, r.stderror etc. These you can use for whatever you want (eg. if you're feeling lucky, use them for so-called "significance").

Edit see also https://juliastats.org/GLM.jl/dev/examples/#Linear-regression-1 and https://juliastats.org/GLM.jl/dev/examples/#Probit-regression-1; MLJ simply wraps around this stuff and puts auxiliary information in report

drcxcruz commented 4 years ago

hi,

First of all, sorry for all the trouble. You guys are very knowledgeable and I appreciate your time.

I started my Jupyter lab notebook. In the lab, we are using the CollegeDistance data set from the AER package in R. I started the lab with the logistic case. Also, I read the GLM examples you provided and they are helpful, thank you.

How do we indicate the distribution in MLJ pipeline as Binominal()? is the distribution defined by MLJ by default part of the @load LinearBinaryClassifier pkg=GLM ?

I tried a few things but I cannot clear the error bellow. Discussion https://discourse.julialang.org/t/glm-jl-logisticregression-errors-matrix-is-not-positive-definite-cholesky-factorization-failed/28981/7 talks about the GLM error. It says "Yes, collinearity can be a problem for nonlinear models, including logistic regression. " However, I was not expecting any issues using such "clean" data from AER. I do not think the data set has colinearty issues. I was able to run both logistic regression and linear regression on the same data set using LogisticClassifier pkg = MLJLinearModels and LinearRegressor pkg = MLJLinearModels. Thus, I suspect I am missing a MLJ pipeline input argument in the code below.

using Queryverse, MLJ, CategoricalArrays @load LinearRegressor pkg = GLM @load LinearBinaryClassifier pkg=GLM

X = copy(dfX) gender ethnicity score fcollege mcollege home urban unemp wage

male | other | 39.15 | yes | no | yes | yes | 6.2 | 8.09 female | other | 48.87 | no | no | yes | yes | 6.2 | 8.09 male | other | 48.74 | no | no | yes | yes | 6.2 | 8.09 male | afam | 40.4 | no | no | yes | yes | 6.2 | 8.09 ...

y = copy(dfYbinary)

Int64 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 1 15 0 16 0 17 0 18 0 19 1 20 1 ...

@pipeline LogisticRegPipe( std = Standardizer(), hot = OneHotEncoder(), reg = LinearBinaryClassifier(), #######Distribution = Binomial() ) coerce!(X, autotype(X, :string_to_multiclass)) yc = categorical(y[:, 1]) LogisticModel = machine(LogisticRegPipe(), X, yc) fit!(LogisticModel)

train_args = MLJBase.Source{:input}[Source{:input} @ 8…98] mach.model = Standardizer @ 1…43 ┌ Info: Training Machine{LogisticRegPipe} @ 1…34. └ @ MLJBase C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:182 train_args = Node{NodalMachine{Standardizer}}[Node{NodalMachine{Standardizer}} @ 7…22] mach.model = OneHotEncoder @ 8…56 train_args = AbstractNode[Node{NodalMachine{OneHotEncoder}} @ 1…43, Source{:target} @ 1…44] mach.model = LinearBinaryClassifier{LogitLink} @ 1…24 ┌ Info: Training NodalMachine{Standardizer} @ 6…81. └ @ MLJBase C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:182 ┌ Info: Training NodalMachine{OneHotEncoder} @ 2…78. └ @ MLJBase C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:182 ┌ Info: Spawning 2 sub-features to one-hot encode feature :gender. └ @ MLJModels C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\builtins\Transformers.jl:691 ┌ Info: Spawning 3 sub-features to one-hot encode feature :ethnicity. └ @ MLJModels C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\builtins\Transformers.jl:691 ┌ Info: Spawning 2 sub-features to one-hot encode feature :fcollege. └ @ MLJModels C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\builtins\Transformers.jl:691 ┌ Info: Spawning 2 sub-features to one-hot encode feature :mcollege. └ @ MLJModels C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\builtins\Transformers.jl:691 ┌ Info: Spawning 2 sub-features to one-hot encode feature :home. └ @ MLJModels C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\builtins\Transformers.jl:691 ┌ Info: Spawning 2 sub-features to one-hot encode feature :urban. └ @ MLJModels C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\builtins\Transformers.jl:691 ┌ Info: Spawning 2 sub-features to one-hot encode feature :income. └ @ MLJModels C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\builtins\Transformers.jl:691 ┌ Info: Spawning 2 sub-features to one-hot encode feature :region. └ @ MLJModels C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\builtins\Transformers.jl:691 ┌ Info: Training NodalMachine{LinearBinaryClassifier{LogitLink}} @ 1…71. └ @ MLJBase C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:182 PosDefException: matrix is not positive definite; Cholesky factorization failed.

Stacktrace: [1] checkpositivedefinite at C:\Users\julia\AppData\Local\Julia-1.4.1\share\julia\stdlib\v1.4\LinearAlgebra\src\factorization.jl:18 [inlined] [2] cholesky!(::LinearAlgebra.Hermitian{Float64,Array{Float64,2}}, ::Val{false}; check::Bool) at C:\Users\julia\AppData\Local\Julia-1.4.1\share\julia\stdlib\v1.4\LinearAlgebra\src\cholesky.jl:226 [3] cholesky! at C:\Users\julia\AppData\Local\Julia-1.4.1\share\julia\stdlib\v1.4\LinearAlgebra\src\cholesky.jl:225 [inlined] (repeats 2 times) [4] GLM.DensePredChol(::Array{Float64,2}, ::Bool) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\GLM\6V3fS\src\linpred.jl:107 [5] cholpred at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\GLM\6V3fS\src\linpred.jl:117 [inlined] (repeats 2 times) [6] fit(::Type{GeneralizedLinearModel}, ::Array{Float64,2}, ::Array{Int64,1}, ::Bernoulli{Float64}, ::LogitLink; dofit::Bool, wts::Array{Int64,1}, offset::Array{Int64,1}, fitargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{,Tuple{}}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\GLM\6V3fS\src\glmfit.jl:462 [7] fit(::Type{GeneralizedLinearModel}, ::Array{Float64,2}, ::Array{Int64,1}, ::Bernoulli{Float64}, ::LogitLink) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\GLM\6V3fS\src\glmfit.jl:457 [8] glm(::Array{Float64,2}, ::Array{Int64,1}, ::Bernoulli{Float64}, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{,Tuple{}}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\GLM\6V3fS\src\glmfit.jl:479 [9] glm(::Array{Float64,2}, ::Array{Int64,1}, ::Bernoulli{Float64}, ::Vararg{Any,N} where N) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\GLM\6V3fS\src\glmfit.jl:479 [10] fit(::LinearBinaryClassifier{LogitLink}, ::Int64, ::DataFrame, ::CategoricalArray{Int64,1,UInt32,Int64,CategoricalValue{Int64,UInt32},Union{}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\GLM.jl:129 [11] fit!(::NodalMachine{LinearBinaryClassifier{LogitLink}}; rows::Function, verbosity::Int64, force::Bool) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:183 [12] fit!(::Node{NodalMachine{LinearBinaryClassifier{LogitLink}}}; rows::Nothing, verbosity::Int64, force::Bool) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\composition\networks.jl:339 [13] (::MLJBase.var"#_fit#133"{Node{NodalMachine{LinearBinaryClassifier{LogitLink}}},Tuple{Standardizer,OneHotEncoder,LinearBinaryClassifier{LogitLink}},MLJBase.Source{:input}})(::LogisticRegPipe, ::Int64, ::DataFrame, ::CategoricalArray{Int64,1,UInt32,Int64,CategoricalValue{Int64,UInt32},Union{}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\composition\composites.jl:223 [14] fit(::LogisticRegPipe, ::Int64, ::DataFrame, ::CategoricalArray{Int64,1,UInt32,Int64,CategoricalValue{Int64,UInt32},Union{}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\composition\composites.jl:384 [15] fit!(::Machine{LogisticRegPipe}; rows::Nothing, verbosity::Int64, force::Bool) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:183 [16] fit!(::Machine{LogisticRegPipe}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:146

ablaom commented 4 years ago
PosDefException: matrix is not positive definite; Cholesky factorization failed.

This has nothing to do with MLJ. You're using a model that uses Cholesky factorisation but your data is ill-conditioned. You might try setting drop_last=true in OneHotEncoder to remove the dependencies among the indicator columns. Or, use a more robust model.

Adding to a previous comment of @tlienart . If you have a probabilistic predictor but you want point predictions, you can add mean/median/mode at the end of your pipeline.

ablaom commented 4 years ago

Sorry, I didn't read all of your post properly.

How do we indicate the distribution in MLJ pipeline as Binominal()? is the distribution defined by MLJ by default part of the @load LinearBinaryClassifier pkg=GLM ?

If you are predicting a binomial target then your target should have Count element scitype (integer machine type). The only model for count that possibly does what you want is "LinearCountRegressor" from GLM. (The other two in the list below only do Poisson.) You control the distribution type with the link hyperparameter. There may be a binomial option, and if there is, you should get Binomial distributions predicted. If not, please raise an issue.

julia> ms = models() do m AbstractVector{Count}<: m.target_scitype end
3-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = EvoTreeCount, package_name = EvoTrees, ... )   
 (name = LinearCountRegressor, package_name = GLM, ... )
 (name = XGBoostCount, package_name = XGBoost, ... )    

julia> info("LinearCountRegressor")
Linear count regressor with specified link and distribution (e.g. log link and poisson).
→ based on [GLM](https://github.com/JuliaStats/GLM.jl).
→ do `@load LinearCountRegressor pkg="GLM"` to use the model.
→ do `?LinearCountRegressor` for documentation.
(name = "LinearCountRegressor",
 package_name = "GLM",
 is_supervised = true,
 docstring = "Linear count regressor with specified link and distribution (e.g. log link and poisson).\n→ based on [GLM](https://github.com/JuliaStats/GLM.jl).\n→ do `@load LinearCountRegressor pkg=\"GLM\"` to use the model.\n→ do `?LinearCountRegressor` for documentation.",
 hyperparameter_ranges = (nothing, nothing, nothing),
 hyperparameter_types = ("Bool", "Distributions.Distribution", "GLM.Link"),
 hyperparameters = (:fit_intercept, :distribution, :link),
 implemented_methods = Symbol[:predict, :fit],
 is_pure_julia = true,
 is_wrapper = false,
 load_path = "MLJModels.GLM_.LinearCountRegressor",
 package_license = "MIT",
 package_url = "https://github.com/JuliaStats/GLM.jl",
 package_uuid = "38e38edf-8417-5370-95a0-9cbb8c7f171a",
 prediction_type = :probabilistic,
 supports_online = false,
 supports_weights = false,
 input_scitype = Table{_s23} where _s23<:(AbstractArray{_s25,1} where _s25<:Continuous),
 target_scitype = AbstractArray{Count,1},
 output_scitype = Unknown,)
ablaom commented 4 years ago

Sorry, it is possible the tree boosters also do binomial, but they are deterministic predictors, not probablistic ones.

drcxcruz commented 4 years ago

hi guys,

I continue to work in my Jupyter lab. Thank you for your teachings. Madly :) I am getting another error. It is strange. The !fit() and predict() calls work fine but fitted_params() call is creating an error. I am only using the first 3 columns of X. The code, error and report are below. I upgraded to Julia 1.4.1 today. Julia 1.4.1 is faster than Julia 1.2 or 1.3 from what I seen so far.

Thank you for your patience and cooperation

using Queryverse, MLJ, CategoricalArrays, PrettyPrinting @load LinearRegressor pkg = GLM @load LinearBinaryClassifier pkg=GLM

X = copy(dfX) y = copy(dfYbinary)

X=X[:,1:3]

@pipeline LinearBinaryClassifierPipe( std = Standardizer(), hot = OneHotEncoder(drop_last = true), reg = LinearBinaryClassifier() )

coerce!(X, autotype(X, :string_to_multiclass)) yc = CategoricalArray(y[:, 1]) yc = coerce(yc, OrderedFactor)

LogisticModel = machine(LinearBinaryClassifierPipe(), X, yc) fit!(LogisticModel) fp = fitted_params(LogisticModel).fitted_params

┌ Info: Training Machine{LinearBinaryClassifierPipe} @ 1…23. └ @ MLJBase C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:182 ┌ Info: Training NodalMachine{Standardizer} @ 1…88. └ @ MLJBase C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:182 ┌ Info: Training NodalMachine{OneHotEncoder} @ 1…70. └ @ MLJBase C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:182 ┌ Info: Spawning 1 sub-features to one-hot encode feature :gender. └ @ MLJModels C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\builtins\Transformers.jl:691 ┌ Info: Spawning 2 sub-features to one-hot encode feature :ethnicity. └ @ MLJModels C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\builtins\Transformers.jl:691 ┌ Info: Training NodalMachine{LinearBinaryClassifier{LogitLink}} @ 8…88. └ @ MLJBase C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\machines.jl:182 MethodError: no method matching coef(::Tuple{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Bernoulli{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},CategoricalValue{Int64,UInt32}}) Closest candidates are: coef(!Matched::Union{StatsModels.TableRegressionModel, StatsModels.TableStatisticalModel}, !Matched::Any...; kwargs...) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\StatsModels\dvYSo\src\statsmodel.jl:28 coef(!Matched::GLM.LinPredModel) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\GLM\6V3fS\src\linpred.jl:255 coef(!Matched::StatsBase.StatisticalModel) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\StatsBase\548SN\src\statmodels.jl:10 ...

Stacktrace: [1] fitted_params(::LinearBinaryClassifier{GLM.LogitLink}, ::Tuple{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Bernoulli{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},CategoricalValue{Int64,UInt32}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJModels\gHake\src\GLM.jl:138 [2] fitted_params(::NodalMachine{LinearBinaryClassifier{GLM.LogitLink}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\operations.jl:45 [3] iterate at .\generator.jl:47 [inlined] [4] collect(::Base.Generator{Array{Any,1},typeof(fitted_params)}) at .\array.jl:665 [5] fitted_params(::Node{NodalMachine{LinearBinaryClassifier{GLM.LogitLink}}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\composition\composites.jl:51 [6] fitted_params(::LinearBinaryClassifierPipe, ::Node{NodalMachine{LinearBinaryClassifier{GLM.LogitLink}}}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\composition\composites.jl:55 [7] fitted_params(::Machine{LinearBinaryClassifierPipe}) at C:\Users\BCP.juliapro\JuliaPro_v1.4.1-1\packages\MLJBase\ESDzL\src\operations.jl:45

ŷ = MLJ.predict(LogisticModel, X) yhatResponse = [pdf(ŷ[i], y[i,1]) for i in 1:nrow(y)] residuals = y .- yhatResponse r = report(LogisticModel) println("========================================") pprint(r) println("\n========================================")

(machines = [NodalMachine{LinearBinaryClassifier{LogitLink}} @ 9…43, NodalMachine{OneHotEncoder} @ 8…08, NodalMachine{Standardizer} @ 1…06], reports = [(deviance = 4626.948518208599, dof_residual = 4735.0, stderror = [0.0733822942304518, 0.1170513744786416, 0.10058298368902677, 0.04532077492126873, 0.06692166868965095], vcov = [0.005384961106524599 0.0001409150321827271 0.0003200563573691588 0.0003967809618831737 -0.0031846087171349007; 0.0001409150321827271 0.013701024267339188 0.0026160533689566032 0.001457575986489906 -0.0028146639042563454; 0.0003200563573691588 0.0026160533689566032 0.010116936607787026 0.0010720776375320784 -0.0026800780167311137; 0.0003967809618831737 0.001457575986489906 0.0010720776375320784 0.0020539726394643004 -0.0014561432554718726; -0.0031846087171349007 -0.0028146639042563454 -0.0026800780167311137 -0.0014561432554718726 0.004478509740207409]), (features_to_be_encoded = [:ethnicity, :gender], new_features = [:genderfemale, :ethnicityafam, :ethnicity__hispanic, :score]), (features_fit = [:score],)])

ablaom commented 4 years ago

@drcxcruz

Correction

In an early post above, I said:

If you are predicting a binomial target then your target should have Count element scitype (integer machine type). The only model for count that possibly does what you want is "LinearCountRegressor" from GLM. (The other two in the list below only do Poisson.) You control the distribution type with the link hyperparameter. There may be a binomial option, and if there is, you should get Binomial distributions predicted. If not, please raise an issue.

This is not quite correct. You control the distribution with the distribution hyperparameter (so set this to (Distributions.Binominal(...)). You can independently choose the link function. The canonical link for Binomial is LogitLink(), the default for LinearCountRegressor.