Iris example for multinomial logistic regression

DilumAluthge commented 4 years ago

Multi-class classifiers are a pretty popular use case in MLJ, so I think an important next step will be getting an example. I think the Iris example is pretty common.

The Iris dataset is available in RDatasets.jl.

MLJ has an Iris example here:

I wrote a tutorial for Turing here: https://turing.ml/dev/tutorials/8-multinomiallogisticregression/

We'll need to do some preprocessing. In particular, we should transform all of the input feature columns by subtracting the column's mean and dividing by the column's standard deviation.

The tricky part is going to be getting MLJModelInterface.predict to return a Vector{UnivariateFinite}.

We can leave predict_joint and predict_particles unmodified.

DilumAluthge commented 4 years ago

@cscherrer I think you actually wrote up some of this code already, but I'm not sure where it is.

DilumAluthge commented 4 years ago

The nice thing about Iris is that we know that the model should be pretty good. So it can serve as a sanity check to make sure everything is working.

cscherrer commented 4 years ago

How about this:

m = @model X,num_levels begin
    k = size(X,2)
    β ~ Normal() |> iid(k,num_levels)
    p = softmax.(eachrow(X*β))

    y ~ For(p) do pj
            Categorical(pj)
        end
end

Then e.g.,

julia> X = randn(10,4);

julia> rand(m(X=X,num_levels=3)) |> pairs
pairs(::NamedTuple) with 4 entries:
  :k => 4
  :p => [[0.352116, 0.40508, 0.242804], [0.133515, 0.18489, 0.681595], [0.614616, 0.115557, 0.269826…
  :β => [0.0344964 0.280074 0.197168; 0.0608256 -0.324698 -0.338867; -1.23773 0.168543 -0.159374; 0.…
  :y => [2, 3, 1, 2, 1, 1, 3, 3, 3, 2]

DilumAluthge commented 4 years ago

Is this right? If you have three levels, you only need two coefficients per covariant, right?

DilumAluthge commented 4 years ago

E.g. https://tamaspapp.eu/DynamicHMCExamples.jl/latest/example_multinomial_logistic_regression/

DilumAluthge commented 4 years ago

The coefficient for the base class is always 1

cscherrer commented 4 years ago

This can be a little tricky...

If you're doing MLE, the "base class" approach is clearly the way to go. The danger outside of that case is in losing symmetry. Like, if you have three classes, and consider the first as the base class but regularize the coefficients for the other two, the estimate will be biased toward the base class.

The way I'm going it here, there are three parameters for each of the features. That's one more than we need, but then we can use softmax to project onto the 2-simplex.

We could reduce the parameter space if we had a parameterization of the simplex that keeps the symmetry we need. There's probably a nice way to do that, maybe we should check the Stan docs

DilumAluthge commented 4 years ago

The danger outside of that case is in losing symmetry. Like, if you have three classes, and consider the first as the base class but regularize the coefficients for the other two, the estimate will be biased toward the base class.

Ooooh, good point.

Yeah let's see what other PPLs do here.

cscherrer commented 4 years ago

https://mc-stan.org/docs/2_19/stan-users-guide/parameterizing-centered-vectors.html

EDIT: Oh, that's a slightly different issue. In the neighborhood though

cscherrer commented 4 years ago

Ok, here: https://mc-stan.org/docs/2_24/stan-users-guide/multi-logit-section.html

DilumAluthge commented 4 years ago

Ok, here: https://mc-stan.org/docs/2_24/stan-users-guide/multi-logit-section.html

Hmm, I don't quite see their solution.

They say the model is only identifiable if you have a "suitable prior"... but they don't tell us what that is?

Then they go on to outline the solution with (K-1) coefficients per covariate, but IIUC that's just the same thing you and I were discussing above, which as you point out is not necessarily what we want.

I like the idea of using the full K coefficients per covariate, because then the model is symmetric and we don't bias things in favor of an arbitrarily chosen base class. But I'm still confused on how we implement that...

cscherrer commented 4 years ago

I think it's just what we have. Gaussian is informative enough to give us identifiability.

DilumAluthge commented 4 years ago

I think it's just what we have. Gaussian is informative enough to give us identifiability.

Huh, I didn't know that.

cscherrer commented 4 years ago

Getting pickier, we could be using log-lengths - maybe interpretable in terms of allometric growth. Also, I think rstanarm does something to account for different scales. Rather than standardize, you can have different priors across the features. But maybe this fine-tuning is bets left until the bulk of this is in place

DilumAluthge commented 4 years ago

Sounds good to me!

DilumAluthge commented 4 years ago

The next main goal is to figure out how to make predict return a Vector{UnivariateFinite}.

After that I think we're pretty much ready to add the example to the docs.

cscherrer commented 4 years ago

Example is almost working here: https://github.com/cscherrer/SossMLJ.jl/blob/cs-dev/examples/example-classification.jl

But I'm getting

julia> fit!(mach)
[ Info: Training Machine{SossMLJModel{Model{,…},…}} @957.
┌ Error: Problem fitting the machine Machine{SossMLJModel{Model{,…},…}} @957, possibly because an upstream node in a learning network is providing data of incompatible scitype. 
└ @ MLJBase ~/.julia/packages/MLJBase/uYW4k/src/machines.jl:425
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: NamedTuple names and field types must have matching lengths
Stacktrace:
 [1] fit_only!(::Machine{SossMLJModel{Model{NamedTuple{(:X, :num_levels),T} where T<:Tuple,TypeEncoding(begin
    k = size(X, 2)
    β ~ Normal() |> iid(k, num_levels)
    p = smax.(eachrow(X * β))
    species ~ For(p) do pj
            Categorical(pj; check_args = false)
        end
end),TypeEncoding(Main)},var"#7#8",typeof(dynamicHMC),NamedTuple{(:num_levels,),Tuple{Int64}},Symbol}}; rows::Nothing, verbosity::Int64, force::Bool) at /home/chad/.julia/packages/MLJBase/uYW4k/src/machines.jl:436
 [2] fit_only! at /home/chad/.julia/packages/MLJBase/uYW4k/src/machines.jl:389 [inlined]
 [3] #fit!#87 at /home/chad/.julia/packages/MLJBase/uYW4k/src/machines.jl:481 [inlined]
 [4] fit!(::Machine{SossMLJModel{Model{NamedTuple{(:X, :num_levels),T} where T<:Tuple,TypeEncoding(begin
    k = size(X, 2)
    β ~ Normal() |> iid(k, num_levels)
    p = smax.(eachrow(X * β))
    species ~ For(p) do pj
            Categorical(pj; check_args = false)
        end
end),TypeEncoding(Main)},var"#7#8",typeof(dynamicHMC),NamedTuple{(:num_levels,),Tuple{Int64}},Symbol}}) at /home/chad/.julia/packages/MLJBase/uYW4k/src/machines.jl:479
 [5] top-level scope at REPL[29]:1
 [6] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1088

cscherrer commented 4 years ago

Also I'm not sure about the Vector{UnivariateFinite} thing. I don't see UnivariateFinite in Distributions.jl. Where does it live?

DilumAluthge commented 4 years ago

It's specific to MLJ.

DilumAluthge commented 4 years ago

I believe the distribution itself is defined in MLJBase and/or MLJModelInterface.

But the documentation is in MLJ: https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Prediction-types-for-probabilistic-responses-1

Use the distribution MMI.UnivariateFinite for Probabilistic models predicting a target with Finite scitype (classifiers). In this case the eltype of the training target y will be a CategoricalValue.

For efficiency, one should not construct UnivariateDistribution instances one at a time. Rather, once a probability vector or matrix is known, construct an instance of UnivariateFiniteVector <: AbstractArray{<:UnivariateFinite},1} to return. Both UnivariateFinite and UnivariateFiniteVector objects are constructed using the single UnivariateFinite function.

DilumAluthge commented 4 years ago

If we run into trouble, we can probably ping Anthony and Thibaut to help us with the specifics.

cscherrer commented 4 years ago

I had a silly mistake that's obvious now, just needed to sleep on it :)

julia> using NamedTupleTools

julia> namedtuple(:y)(randn(3))
ERROR: NamedTuple names and field types must have matching lengths
Stacktrace:
 [1] NamedTuple{(:y,),T} where T<:Tuple(::Tuple{Float64,Float64,Float64}) at ./boot.jl:547
 [2] NamedTuple{(:y,),T} where T<:Tuple(::Array{Float64,1}) at ./namedtuple.jl:104
 [3] top-level scope at REPL[33]:1
 [4] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1088

julia> namedtuple(:y)(tuple(randn(3)))
(y = [0.6148708039999163, 0.7939461461720311, -0.013690630507490925],)

cscherrer / SossMLJ.jl

Iris example for multinomial logistic regression #63