Should clustering models return a categorical array?

juliohm commented 4 years ago

Describe the bug Currently some clustering models like KMeans return a vector of Int as prediction:

that = MLJBase.predict(model, theta, X)

but I think it would be more useful to return a categorical array with the available clusters as categories. This level of abstraction will simplify code and make bugs more apparent.

To Reproduce

using MLJ

@load KMeans

X = rand(100,2)
m = KMeans()
theta, _, __ = fit(m, 0, X)
c = predict(m, theta, X)

Now c is a vector of clusters as Int.

Expected behavior c could instead be categorical(c).

ablaom commented 4 years ago

Thanks @juliohm for that observation.

For consistency with the rest of the API, I agree this should be a categorical vector. And all the levels should appear in the pool, even if the actual predictions do not cover all classes.

@tlienart Any objections?

Just to clarify, as this is an unsupervised model you also have transform (predict is not usually implemented for unsupervised models). The current behaviour is to project the input onto a space of the same dimension as the number of clusters. That is, the output of transform for a given observation, is the distances from each cluster. The input expected is a table and so is the output.

In the example below the algorithm assigns 3 clusters (X is 10 dimensional):

using MLJ

m = @load Means

X = MLJ.table(rand(200, 10))
theta, _, report = MLJ.fit(m, 0, X)
c = transform(m, theta, X) |> pretty
julia> c = transform(m, theta, X) |> pretty
┌─────────────────────┬─────────────────────┬────────────────────┐
│ x1                  │ x2                  │ x3                 │
│ Float64             │ Float64             │ Float64            │
│ Continuous          │ Continuous          │ Continuous         │
├─────────────────────┼─────────────────────┼────────────────────┤
│ 1.1687702203063468  │ 0.8226492403729635  │ 1.2441672001399233 │
│ 0.7662137239477751  │ 0.7006824329995309  │ 1.0873398138248938 │
│ 0.7674789159640136  │ 1.392701275047787   │ 0.3883277024655074 │
│ 1.0347328939459484  │ 0.7220430145458749  │ 1.3668321574840325 │
│ 1.5241670643446659  │ 1.137851042320218   │ 0.7583693758138992 │
│ 0.8159035391810656  │ 1.3315006325642909  │ 1.3109975947728154 │
│          ⋮          │          ⋮          │         ⋮          │
└─────────────────────┴─────────────────────┴────────────────────┘

juliohm commented 4 years ago

Awesome @ablaom , nice to learn about the fact that most unsupervised models implement transform and that only sometimes we have an implementation of predict. In the case of KMeans I understand that the result of transform is post-processed to find out the closest cluster. That is nice to have when the model allows evaluation on new samples.

ablaom commented 4 years ago

resolved on dev branch of MLJModels

JuliaAI / MLJ.jl

Should clustering models return a categorical array? #418