JuliaAI / MLJLinearModels.jl

Generalized Linear Regressions Models (penalized regressions, robust regressions, ...)
MIT License
80 stars 13 forks source link

Problems when only one class manifest in the training target #123

Closed ablaom closed 1 year ago

ablaom commented 2 years ago

In the training target below we have length(levels(y)) == 2 but y itself only exhibits one class. This is crashing fit. Occasionally, especially in smaller data sets, a large class may be "hidden" when we restrict to a particular fold, so this is an issue.

julia> X = (x1=rand(10), )
julia> using MLJBase
julia> y = coerce(vcat(fill("a", 10), ["b", ]), Multiclass)[1:10]
10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"
 "a"

julia> levels(y)
2-element Vector{String}:
 "a"
 "b"

julia> mach = machine(Multinomial
MultinomialClassifier   MultinomialNBClassifier
MultinomialLoss         MultinomialRegression
julia> mach = machine(MultinomialClassifier(), X, y) |> fit!
[ Info: Training machine(MultinomialClassifier(lambda = 1.0, …), …).
[ Info: Solver: LBFGS()
┌ Error: Problem fitting the machine machine(MultinomialClassifier(lambda = 1.0, …), …). 
└ @ MLJBase ~/MLJ/MLJBase/src/machines.jl:617
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: ArgumentError: invalid Array dimensions
Stacktrace:
  [1] Array
    @ ./boot.jl:459 [inlined]
  [2] Array
    @ ./boot.jl:467 [inlined]
  [3] zeros
    @ ./array.jl:525 [inlined]
  [4] zeros
    @ ./array.jl:522 [inlined]
  [5] zeros
    @ ./array.jl:520 [inlined]
  [6] scratch(n::Int64, p::Int64, c::Int64; i::Bool)
    @ MLJLinearModels ~/MLJ/MLJLinearModels/src/fit/default.jl:49
  [7] fit(glr::GeneralizedLinearRegression{MultinomialLoss{0}, ScaledPenalty{L2Penalty}}, X::Matrix{Float64}, y::Vector{Int64}; solver::LBFGS)                                         
    @ MLJLinearModels ~/MLJ/MLJLinearModels/src/fit/default.jl:42
  [8] fit(m::MultinomialClassifier, verb::Int64, X::NamedTuple{(:x1,), Tuple{Vector{Float64}}}, y::CategoricalArrays.CategoricalVector{String, UInt32, String, CategoricalArrays.CategoricalValue{String, UInt32}, Union{}})      
    @ MLJLinearModels ~/MLJ/MLJLinearModels/src/mlj/interface.jl:78
  [9] fit_only!(mach::Machine{MultinomialClassifier, true}; rows::Nothing, verbosity::Int64, force::Bool)                                                                         
    @ MLJBase ~/MLJ/MLJBase/src/machines.jl:615
 [10] fit_only!
    @ ~/MLJ/MLJBase/src/machines.jl:568 [inlined]
 [11] #fit!#58
    @ ~/MLJ/MLJBase/src/machines.jl:683 [inlined]
 [12] fit!
    @ ~/MLJ/MLJBase/src/machines.jl:681 [inlined]
 [13] |>(x::Machine{MultinomialClassifier, true}, f::typeof(fit!))
    @ Base ./operators.jl:966
 [14] top-level scope
    @ REPL[17]:1
tlienart commented 2 years ago

MLJLM uses base data types (vector of reals), so it doesn't get the levels information and will just see a vector with a single unique element. It may be made to take an optional levels argument to guarantee that this line:

https://github.com/JuliaAI/MLJLinearModels.jl/blob/55f22b0492152badf66cfac37410341e7fb9a0b6/src/fit/default.jl#L40

gets a c=2 in this case as opposed to c=1 (via maximum(y) here)

that would mean calling fit with an additional argument but I think that would be fine.

tlienart commented 2 years ago

This here

https://github.com/JuliaAI/MLJLinearModels.jl/blob/55f22b0492152badf66cfac37410341e7fb9a0b6/src/mlj/interface.jl#L57-L81

should get the right number of classes though via the interface. Would need to figure out why the nclasses is not passed properly (or not computed properly)

there's definitely a bug here somewhere

ablaom commented 1 year ago

It has something to do with the encoding, which is weird. In the encoding, the classes are number starting with -1, not 0 or 1.

ablaom commented 1 year ago

@tlienart The problem is that, for some reason I don't understand at all, the encoding of y is special-cased if the pool of y has only two classes:

https://github.com/JuliaAI/MLJLinearModels.jl/blob/55f22b0492152badf66cfac37410341e7fb9a0b6/src/mlj/interface.jl#L66

If we subsample and only see one of the two classes, then the encoded y looks like [-1, -1, ..., -1]. In view of this, I think the definition of getc at this line

https://github.com/JuliaAI/MLJLinearModels.jl/blob/55f22b0492152badf66cfac37410341e7fb9a0b6/src/loss-penalty/generic.jl#L41

is incorrect. The problem for me is that I really can't figure out what this getc is computing, as I'm not familiar enough with the code. What does the scratch function do, for example? I think you're the only one who can safely make a fix here.

For what it's worth, a better and safer design would probably be to remove all this binary special-casing altogether, if that makes sense here. But maybe you have your reasons...

tlienart commented 1 year ago

the distinction binary/multiclass is in the use of the internal representation of the vector. For binary it's more convenient to have -1,1 as it allows to have computations with a single column as opposed to doing computation with one column per class. I'm not quite ready to say that this is hugely essential and warrants the code style, but when initially working on this, I was keen to try to do stuff like this to squeeze out a bit more performances. Same with scratch space which initialises a bunch of arrays in which computations can be done in place so that you only need to allocate once.

Anyway here the problem is that I was trying to do a bit too much for the user:

anyway, I've removed this by only recoding to -1/1 if the user explicitly specifies Binary; otherwise a Multinomial case with redundant computations is used. A test case with your example is added. maybe not the way you'd have preferred but I can't do much more at the moment