Closed juliohm closed 4 years ago
Thanks @juliohm for that observation.
For consistency with the rest of the API, I agree this should be a categorical vector. And all the levels should appear in the pool, even if the actual predictions do not cover all classes.
@tlienart Any objections?
Just to clarify, as this is an unsupervised model you also have transform
(predict
is not usually implemented for unsupervised models). The current behaviour is to project the input onto a space of the same dimension as the number of clusters. That is, the output of transform
for a given observation, is the distances from each cluster. The input expected is a table and so is the output.
In the example below the algorithm assigns 3 clusters (X
is 10 dimensional):
using MLJ
m = @load Means
X = MLJ.table(rand(200, 10))
theta, _, report = MLJ.fit(m, 0, X)
c = transform(m, theta, X) |> pretty
julia> c = transform(m, theta, X) |> pretty
┌─────────────────────┬─────────────────────┬────────────────────┐
│ x1 │ x2 │ x3 │
│ Float64 │ Float64 │ Float64 │
│ Continuous │ Continuous │ Continuous │
├─────────────────────┼─────────────────────┼────────────────────┤
│ 1.1687702203063468 │ 0.8226492403729635 │ 1.2441672001399233 │
│ 0.7662137239477751 │ 0.7006824329995309 │ 1.0873398138248938 │
│ 0.7674789159640136 │ 1.392701275047787 │ 0.3883277024655074 │
│ 1.0347328939459484 │ 0.7220430145458749 │ 1.3668321574840325 │
│ 1.5241670643446659 │ 1.137851042320218 │ 0.7583693758138992 │
│ 0.8159035391810656 │ 1.3315006325642909 │ 1.3109975947728154 │
│ ⋮ │ ⋮ │ ⋮ │
└─────────────────────┴─────────────────────┴────────────────────┘
Awesome @ablaom , nice to learn about the fact that most unsupervised models implement transform
and that only sometimes we have an implementation of predict
. In the case of KMeans
I understand that the result of transform
is post-processed to find out the closest cluster. That is nice to have when the model allows evaluation on new samples.
resolved on dev branch of MLJModels
Describe the bug Currently some clustering models like
KMeans
return a vector ofInt
as prediction:but I think it would be more useful to return a categorical array with the available clusters as categories. This level of abstraction will simplify code and make bugs more apparent.
To Reproduce
Now
c
is a vector of clusters asInt
.Expected behavior
c
could instead becategorical(c)
.