JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.8k stars 157 forks source link

New BetaML models (to be checked) #749

Closed sylvaticus closed 3 years ago

sylvaticus commented 3 years ago

Hello, I have interfaced to MLJ the following BetaML models: KMeans, KMedoids, GMM, MissingImputator. The MLJ documentation is a bit sparse for unsupervised models, compared to supervised ones, so let me check I done it correctly.

KMeans and KMedoids:

GMM:

MissingImputator

Please let me know if this set-up makes sense for MLJ. The full interface for the new models is here.

ablaom commented 3 years ago

@sylvaticus Thanks for this extra work. I have not had a chance to review the interfaces but will already respond to your nice outline:

Hello, I have interfaced to MLJ the following BetaML models: KMeans, KMedoids, GMM, MissingImputator. The MLJ documentation is a bit sparse for unsupervised models, compared to supervised ones, so let me check I done it correctly.

KMeans and KMedoids:

  • implemented as <: MMI.Unsupervised
  • fit trains the model (compute the centroids/assign the elements) --> fitResults
  • transform returns a (nObservation, nClasses) matrix of distances from class centres
  • predict returns a CategoricalArray of the predicted classes given new data (the centroids don't move further)

This is correct. I'm surmising that you found the relevant docs that exist, but FYI there is some here and here and existing implementations here and here

GMM:

  • still implemented as <: MMI.Unsupervised (there is no distinction in unsupervised models between deterministic and probabilistic)
  • fit trains the model (estimate the mixtures/assign the probabilities) --> fitResults
  • predict returns a vector of UniVariateFinite distributions
  • transform is just an alias of predict (not sure what else should it do)

So far we have not implemented any "distribution-fitters" but the design was discussed and the decision made to implement them in different way, namely as supervised models (subtype Probabilistic) where the data to be fit is y and X is nothing. This actually quite natural, from a perfomance evaluation point-of-view, once you wrap your head around the apparent perversion.

There is a sample implementation linked from this section of the docs (recently tweaked). As this will be our first such model, and the API is marked experimental, there could be some teething issues, but I don't anticipate major issues.

It will be great to finally have GMM!

MissingImputator

  • implemented as <: MMI.Static
  • transform return a copy of the X with the missing value imputed

Please let me know if this set-up makes sense for MLJ. The full interface for the new models is here.

I always imagined imputers should learn from the data to prevent data leakage. So, if imputing with mean value, then you learn the mean from the training data and use that for imputing the validation data. That is how the existing MLJ imputer is designed. This kind of imputer cannot be Static.

If however, the BetaML imputer is genuinely static, in the sense that it transforms data without ever seeing training data, then it should indeed be Static.

Hope this helps. When you're ready, let me know and I will try to find time to review the interfaces more carefully.

sylvaticus commented 3 years ago

So far we have not implemented any "distribution-fitters" but the design was discussed and the decision made to implement them in different way, namely as supervised models (subtype Probabilistic) where the data to be fit is y and X is nothing. This actually quite natural, from a perfomance evaluation point-of-view, once you wrap your head around the apparent perversion.

There is a sample implementation linked from this section of the docs (recently tweaked). As this will be our first such model, and the API is marked experimental, there could be some teething issues, but I don't anticipate major issues.

I am trying to implement GMM as a <: MMI.Probabilistic model like you described it, but I have the problem that when I try to create the machine it complains that the number of observations on X(nothing) and Y are not the same:

julia> X,y = Mlj.@load_iris
julia> modelMachine =  Mlj.machine(model, nothing, X)
ERROR: DimensionMismatch("Differing number of observations in input and target. ")
Stacktrace:
 [1] check(::GMM{DiagonalGaussian{Float64}}, ::MLJBase.Source, ::Vararg{MLJBase.Source, N} where N; full::Bool)
   @ MLJBase ~/.julia/packages/MLJBase/4DmTL/src/machines.jl:108
 [2] #machine#87
   @ ~/.julia/packages/MLJBase/4DmTL/src/machines.jl:249 [inlined]
 [3] machine(model::GMM{DiagonalGaussian{Float64}}, raw_arg1::Nothing, raw_args::NamedTuple{(:sepal_length, :sepal_width, :petal_length, :petal_width), NTuple{4, Vector{Float64}}})
   @ MLJBase ~/.julia/packages/MLJBase/4DmTL/src/machines.jl:248
 [4] top-level scope
   @ none:1", "setosa", "setosa", "setosa", "setosa"  …  "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"])

Let me comment that I think there are several way to evaluate the output of the cluster. In particular, if labels are actually available (but not used in the fitting) the way I implemented accuracy() in BetaML is to loop over all the permutations that maintain the structure (as the clustering algorithm has no clue on the actual name of the classes) and retain the one for which the accuracy is maximal.

Concerning the MissingInputator, it uses GMM in the background, so missing values are computed as the expected values of the mixtures weighted by the probability of the specific record to be part of such mixture. In my workflow I was thinking to something like:

transform(MissingInputator, Xtrain_with_missing) --> Xtrain_full
transform(MissingInputator, Xtest_with_missing)  --> Xtest_full

While I guess you are thinking to:

fit(MissingImputator, Xtrain_with_missing) --> fitresults
predict(MissingImputator,fitresults,Xtrain_with_missing) -> Xtrain_full
predict(MissingImputator,fitresults,Xtest_with_missing)  -> Xtest_full

Right ? How should then implement MissingInputator ? As :< MMI.Deterministic model ?

It will be great to finally have GMM!

Great, but please consider that the model is not computationally efficient (but for a few thousand of data it works fine). However it is very flexible, as it can use any mixture (gaussian ones being the only one implemented) and, contrary to scikit-learn gmm model, it accepts X matrices with missing data (hence I can use it for missing imputation ...).

EDIT: Concerning the MissingInputator, I think I did manage to support both the strategies:

ablaom commented 3 years ago

I am trying to implement GMM as a <: MMI.Probabilistic model like you described it, but I have the problem that when I try to create the machine it complains that the number of observations on X(nothing) and Y are not the same:

You need the latest version of MLJBase where I relaxed this check for just this case.

Will look over your other comments Monday.

sylvaticus commented 3 years ago

Great, thanks...

ablaom commented 3 years ago

Concerning the MissingInputator, it uses GMM in the background, so missing values are computed as the expected values of the mixtures weighted by the probability of the specific record to be part of such mixture. In my workflow I was thinking to something like:

transform(MissingInputator, Xtrain_with_missing) --> Xtrain_full
transform(MissingInputator, Xtest_with_missing)  --> Xtest_full

While I guess you are thinking to:

fit(MissingImputator, Xtrain_with_missing) --> fitresults
predict(MissingImputator,fitresults,Xtrain_with_missing) -> Xtrain_full
predict(MissingImputator,fitresults,Xtest_with_missing)  -> Xtest_full

Right ?

Well, almost. Rather:

 fit(MissingImputator, Xtrain_with_missing) --> fitresults
transform(MissingImputator,fitresults,Xtrain_with_missing) -> Xtrain_full
transform(MissingImputator,fitresults,Xtest_with_missing)  -> Xtest_full

I am not suggesting that MissingImputor be implemented as a Supervised model. The choices I am considering are:

  1. plain Unsupervised, if one separates training from transformation (as in the above workflow); then the signatures must look like MMI.fit(model, verbosity, Xtrain) and MMI.transform(model, fitresult, Xnew).

  2. Static (a subtype of Unsupervised), if there is no training of consequence. Then fit is never implemented (it has a no-op fallback) and we have signature transform(model, fitresult, X...) . I don't think this applies here.

Concerning the MissingInputator, I think I did manage to support both the strategies:

  • with transform(m::MissingInputator, X_with_missing) I obtain a X_full without learning from previous data
  • with fit(m::MissingInputator, X_with_or_without_missing) and then predict(m::MissingInputator, fitresults, X2_with_missing) I obtain a X2_full using the learning information from X

This is not compatible with the API. Your MMI.transform method must always include a fitresult in the second slot.

The workflow you are after, I'm guessing, is a "fit and transform in one hit" option, like sk-learn's fit_transform method. We don't have that in the current API, although there's the possibility to at least offer this on the user side as a shortcut (https://github.com/alan-turing-institute/MLJ.jl/issues/546). Rather, if you want to fit and transform on the same data, then you just have to do it in steps. So, assuming you implement the imputation as a Unsupervised model, the user workflow to fit and transform on the same data (ie, to reproduce your first workflow on the user side): is

mach = machine(MissingImputator(), Xtrain) |> fit!
Xtrain_full = transform(mach, Xtrain)

mach = machine(MissingImputator(), Xtest) |> fit!
Xtest_full = transform(mach, Xtest)

Whereas, a non-leakage workflow would replace the second block with

Xtest_full = transform(mach, Xtest)

(Or write the whole thing as

mach = machine(MissingImputator(), X)
fit!(mach, rows=train)
Xfull  = transform(mach,  X)
...

)

Does that address your question?

sylvaticus commented 3 years ago

Yes, thanks I think I got it this time. I implemented MissingImputator as a unsupervised model as you suggested:

fit(m::MissingImputator, verbosity, X) --> (fitResults, cache, report)
transform(m::MissingImputator, fitResults, X) --> table(X̂_full)

I test them with:

julia> import MLJBase

julia> const Mlj = MLJBase
MLJBase

julia> using BetaML

julia> X = [1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38; missing -2.3; 5.2 -2.4]
9×2 Matrix{Union{Missing, Float64}}:
 1.0       10.5
 1.5         missing
 1.8        8.0
 1.7       15.0
 3.2       40.0
  missing    missing
 3.3       38.0
  missing  -2.3
 5.2       -2.4

julia> X = Mlj.table(X)
Tables.MatrixTable{Matrix{Union{Missing, Float64}}}: (x1 = Union{Missing, Float64}[1.0, 1.5, 1.8, 1.7, 3.2, missing, 3.3, missing, 5.2], x2 = Union{Missing, Float64}[10.5, missing, 8.0, 15.0, 40.0, missing, 38.0, -2.3, -2.4])

julia> model                       =  MissingImputator()
MissingImputator(
    K = 3,
    p₀ = nothing,
    mixtures = DiagonalGaussian{Float64}[DiagonalGaussian{Float64}(nothing, nothing), DiagonalGaussian{Float64}(nothing, nothing), DiagonalGaussian{Float64}(nothing, nothing)],
    tol = 1.0000000000000004e-6,
    minVariance = 0.05,
    minCovariance = 0.0,
    initStrategy = "kmeans",
    rng = Random._GLOBAL_RNG()) @761

julia> modelMachine                =  Mlj.machine(model,X)
Machine{MissingImputator{DiagonalGaussian{Float64}},…} @335 trained 0 times; caches data
  args: 
    1:  Source @274 ⏎ `ScientificTypes.Table{AbstractVector{Union{Missing, ScientificTypes.Continuous}}}`

julia> (fitResults, cache, report) = Mlj.fit(model, 0, X)
((pₖ = [0.49997459736828426; 0.25001270131585945; 0.25001270131585634], mixtures = DiagonalGaussian{Float64}[DiagonalGaussian{Float64}([1.5, 11.166666666666666], [0.05, 0.05]), DiagonalGaussian{Float64}([3.2499999999999782, 39.0], [0.05, 0.05]), DiagonalGaussian{Float64}([5.199999999999994, -2.3499999999999996], [0.05, 0.05])]), nothing, ([2.886751345948105, 0.1814436846505966, 0.020160409405627352, 0.0022400454895143513], -275.7794472196741, 582.3200385220554, 579.5588944393483))

julia> XD                          =  Mlj.transform(model,fitResults,X)
Tables.MatrixTable{Matrix{Float64}}: (x1 = [1.0, 1.5, 1.8, 1.7, 3.2, 2.8625692221714156, 3.3, 5.199999999999994, 5.2], x2 = [10.5, 11.166666666667362, 8.0, 15.0, 40.0, 14.746015173838764, 38.0, -2.3, -2.4])

julia> XDM                         =  Mlj.matrix(XD)
9×2 Matrix{Float64}:
 1.0      10.5
 1.5      11.1667
 1.8       8.0
 1.7      15.0
 3.2      40.0
 2.86257  14.746
 3.3      38.0
 5.2      -2.3
 5.2      -2.4

julia> # Use the previously learned structure to imput missings..
       Xnew_withMissing            = Mlj.table([1.5 missing; missing 38; missing -2.3; 5.1 -2.3])
Tables.MatrixTable{Matrix{Union{Missing, Float64}}}: (x1 = Union{Missing, Float64}[1.5, missing, missing, 5.1], x2 = Union{Missing, Float64}[missing, 38.0, -2.3, -2.3])

julia> XDNew                       = Mlj.transform(model,fitResults,Xnew_withMissing)
Tables.MatrixTable{Matrix{Float64}}: (x1 = [1.5, 3.2499999999999782, 5.199999999999994, 5.1], x2 = [11.166666666667362, 38.0, -2.3, -2.3])

julia> XDMNew                      =  Mlj.matrix(XDNew)
4×2 Matrix{Float64}:
 1.5   11.1667
 3.25  38.0
 5.2   -2.3
 5.1   -2.3

Could you please check and validate the clusters MLJ interface and the Perceptron one ?

ablaom commented 3 years ago

Thanks for the work on these interfaces. I've just had a look at the clustering ones, but not the perceptron.

KMeans and KMedoids

The metadata looks good, except output_scitype is meant to refer to the scitype of the output of the transform method, which here is Table(Continuous). (You can specify a target_scitype for an unsupervised model, which falls back to Unknown and is not really used anywhere that I know of. Still it wouldn't hurt if you want to declare it to be AbstractVector{<:MultiClass}, a supertype of the scitype of the output of predict.)

However, I notice some strange behaviour in the KMeans model noted below:

using Random
Random.seed!(123)
X, _ = MLJBase.make_regression(20, 8);
mach = MLJBase.machine(KMeans(), X) |> MLJBase.fit!;

# It appears we have three clusters:
julia> MLJBase.schema(MLJBase.transform(mach, X))
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1      │ Float64 │ Continuous │
│ x2      │ Float64 │ Continuous │
│ x3      │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 20

# Bit surprised that all training data is assigned a single class:
yhat = MLJBase.predict(mach, X)
20-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2

# And I would say the following is incorrect, as the predicted `yhat` should have
# three levels in its pool, not just the ones that are manifest for
# the particular input `X` in definition of `yhat`:
julia> MLJBase.classes(yhat)
1-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 2

For comparison, here is the behaviour of the Clustering.jl model:

using MLJ
KMeans = @load KMeans pkg=Clustering
model = KMeans()

using Random
Random.seed!(123)

X, _ = make_regression(20, 8);
mach = machine(KMeans(), X) |> fit!;
schema(transform(mach, X))

# again, three clusters:
julia> schema(transform(mach, X))
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1      │ Float64 │ Continuous │
│ x2      │ Float64 │ Continuous │
│ x3      │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 20

yhat = predict(mach, X)
20-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 3
 1
 1
 3
 1
 3
 3
 1
 2
 3
 3
 1
 3
 3
 1
 3
 1
 1
 1

# All clusters are tracked, even if not explicitly manifest:
julia> ysmall = predict(mach, selectrows(X, 1:2))
2-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 3

julia> classes(ysmall)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 3

Finally, can I suggest you overload MMI.fitted_params to make them more user-friendly. So, the fallback is giving:

MLMBase.fitted_params(mach)
(fitresult = ([3, 3, 2, 3, 1, 1, 3, 1, 2, 2  …  2, 3, 3, 3, 3, 3, 2, 3, 1, 1], [-0.09844314369373444 -0.9399263119056788; -0.994364450422251 0.5485269272699657; 1.0271438614771895 0.7414183922931447], BetaML.Clustering.var"#39#41"()),)

Maybe something like

MMI.fitted_params(model::KMeans, fitresult) = (centers=fitesult[2], cluster_labels=CategoricalArrays.categorical(fitresults[1]))

GMM

The metadata looks good. However, I get an error when I try to use this as I expected it to work (cf the example referred to in docs, which is here)

using MLJBase
import BetaML

GMM = BetaML.Clustering.GMM
y, _ = make_regression(1000, 3, rng=123);

julia> schema(y)
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1      │ Float64 │ Continuous │
│ x2      │ Float64 │ Continuous │
│ x3      │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 1000

 mach = machine(GMM(), nothing, y) |> fit!

# expecting a multivariate distrubution here:
julia> d = predict(mach, nothing)
ERROR: ArgumentError: Function `matrix` only supports AbstractMatrix or containers implementing the Tables interface.
Stacktrace:
 [1] matrix(::MLJModelInterface.FullInterface, ::Val{:other}, ::Nothing; kw::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:32
 [2] matrix(::MLJModelInterface.FullInterface, ::Val{:other}, ::Nothing) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:32
 [3] matrix(::Nothing; kw::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:27
 [4] matrix(::Nothing) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:27
 [5] predict(::BetaML.Clustering.GMM{BetaML.Clustering.DiagonalGaussian{Float64}}, ::NamedTuple{(:pₖ, :mixtures),Tuple{Array{Float64,2},Array{BetaML.Clustering.DiagonalGaussian{Float64},1}}}, ::Nothing) at /Users/anthony/.julia/packages/BetaML/1uT9C/src/Clustering_MLJ.jl:149
 [6] predict(::Machine{BetaML.Clustering.GMM{BetaML.Clustering.DiagonalGaussian{Float64}},true}, ::Nothing) at /Users/anthony/.julia/packages/MLJBase/pCCd7/src/operations.jl:83
  [7] top-level scope at REPL[16]:1
sylvaticus commented 3 years ago

Thanks, I'll have a look on this...

sylvaticus commented 3 years ago
  1. Metadata of KMedoids and KMeans

    • Implemented:
      input_scitype    = MMI.Table(MMI.Continuous),         # scitype of the inputs
      output_scitype   = MMI.Table(MMI.Continuous),         # scitype of the output of `transform`
      target_scitype   = AbstractArray{<:MMI.Multiclass},   # scitype of the output of `predict`
  2. KMeans homogeneous predictions

    • Solved switching default initStrategy to "random" (it was due to an unfortunate "grid" init strategy that for random variables and 3 classes it was selecting and accumulating points toward the central class)
  3. Missing factors in KMeans and KMedoids output

    • Solved with passing all the classes in the returned categorical array, i.e.: return CategoricalArray(assignedClasses,levels=1:nCl)
  4. More user-friendly fitted parameters for KMEans/KMedoids

    • Implemented the way suggested (and used named tuples in the code)
  5. Error in predict(GMM)

It seems there could be a problem in the MLJ design/architecture (still a bit voodoo for me...)

This works:

using MLJBase, BetaML
y, _                        =  make_regression(1000, 3, rng=123);
ym                          =  MLJBase.matrix(y)
model                       =  GMM(rng=copy(BetaML.FIXEDRNG))
(fitResults, cache, report) =  MLJBase.fit(model, 0, nothing, y)
yhat_prob                   =  MLJBase.transform(model,fitResults,y) # ok
yhat_prob                   =  MLJBase.predict(model, fitResults, y) # ok

However if we run instead:

modelMachine                =  MLJBase.machine(model, nothing, y)
mach                        =  MLJBase.fit!(modelMachine)
yhat_prob                   =  MLJBase.predict(mach, nothing)

we have the error that you reported. This is due to the fact that in yhat_prob = MLJBase.predict(model, fitResults, y) the third parameter, the "X" in MMI.predict(m::GMM, fitResults, X) is a Tables.MatrixTable{Matrix{Float64}}, while in MLJBase.predict(mach, nothing) it is nothing, and then of course MMI.matrix(X) complains about it.

How can I construct the interface so that the "version" of predict that uses the "machine" understand that the X is embedded in the machine ?

ablaom commented 3 years ago

Thanks for these new changes. I'll look over these and Perceptron shortly.

It seems there could be a problem in the MLJ design/architecture (still a bit voodoo for me...)

Nah. I think you just need to delete this irrelevant line:

https://github.com/sylvaticus/BetaML.jl/blob/863113c91717288f7f9a6391298b8b2b38b86cb7/src/Clustering_MLJ.jl#L151

X is alwaysnothing and you make no use of it in the rest of the method.

sylvaticus commented 3 years ago

Arghh... yes, sure... sorry...

EDIT Actually, no I was using it 4 lines later, in the gmm(x;...) function. My original understood was that fit, for a GMM model, was responsible to compute the mixtures. The posterior probabilities (probabilistic class assignments) were not even included in fitresults. predict than would have used the information of the mixtures provided by fit to compute the posterior probabilities (by running a single E-step of the EM algorithm) of X, either the "original" X or a new matrix. The idea was that this could have allowed to estimate the class probabilities for a new dataset, perhaps a small one, assuming the observations were identically distributed of the original X, i.e. without running the EM algorithm on this second set.

Any how, I have now changed it, fitresults returned by fit includes the class probabilities and predict returns the class probabilities of the original X in the required format. I think this is what you originally meant....

ablaom commented 3 years ago

No, what I had in mind was that predict return a single multivariate mixture distribution, d. Then, for example, one could call pdf(d, y) on a single new observation y (you probably want me to call this x) to get a single probability. A technical issue I have overlooked is that y is going to be the row of a table, but the mixture distributions in Distribution.jl have ordinary Euclidean (Vector{Float64}) support.

In any case, I can see there is some tension here between two competing conceptualisations of GMM. We want either:

  1. to view GMM as a "distribution fitter", in which case I maintain the integration I have suggested is the appropriate one (modulo technical difficulties we should sort out anyway); or

  2. to view GMM as a "probabilistic" clustering algorithm, ie, a clusterer that predicts a categorical distribution instead of a single class. I see that this is how sk-learn conceptualises GMM, for example.

In either case, interface points can be introduced to get the "secondary" objects, but those interface points will be less discoverable to the user. So we should probably choose according to the greater use-case. On these grounds, my push for case 1 may have been mistaken, and I suspect you agree, yes?

If we revert to case 2, then GMM becomes Unsupervised and predict(::GMM, fitresult, Xnew) outputs a vector of UnivariateFinite element type for new data. It would be good to overload fitted_params to return a named tuple which includes the keys mixtures and weights (or whatever you want to call them), and perhaps even the Distribution.jl object I discuss above (with the Euclidean support).

Thanks for you patience. What are your thoughts?

ablaom commented 3 years ago

Please also see this side issue: https://github.com/sylvaticus/BetaML.jl/issues/20

sylvaticus commented 3 years ago

A quick reply: building the interface is a few lines of code. We can have two separate MLJ models that wrap BetaML.gmm for the two interpretations. We have to take care of the names however not to generate confusion. I'll have a look...

ablaom commented 3 years ago

Yes, that's a very good idea, thanks. Maybe GMMFitter and GMMClusterer??

A technical issue I have overlooked is that y is going to be the row of a table, but the mixture distributions in Distribution.jl have ordinary Euclidean (Vector{Float64}) support.

In the "distribution" fitting case we could make the following pragmatic choice: the training "target" y (in training data = (nothing, y)) is be a matrix instead of a table. That is target_scitype(...) = AbstractMatrix{Continuous}. Then we can use Distributions.jl to output an object d with a consistent representation. That is, y is matrix whose rows are observations, and the call pdf(d, selectrows(y, i))=pdf(d, y[i, :]) now makes sense.

sylvaticus commented 3 years ago

Hello, I did implement GMMClusterer, but I have troubles to see for GMMFitter how to return in predict the Distribution object, as the mixturer I use aren't Distributions.jl object, but rather are custom types defined in BetaML.

ablaom commented 3 years ago

Does your object implement any of the Distributions.jl API, such as pdf?

sylvaticus commented 3 years ago

They implement only what is needed for the EM algo, that is the log pdf (lpdf):

{Spherical|Diagonal|Full}Gaussian mixtures for gmm / predictMissing are already provided. User defined mixtures can be used defining a struct as subtype of Mixture and implementing for that mixture the following functions:

ablaom commented 3 years ago

What about converting this to a Distributions.MixtureModel? You already have Distributions as a dependency, right?

ablaom commented 3 years ago

Does Distributions not already provide the atomic distributions you need?

sylvaticus commented 3 years ago

If I remember correctly the "problem" was mosty with missing data. I'll check better tomorrow the mixture model of Distributions.

ablaom commented 3 years ago

Second thoughts, I think it's just fine, maybe even better, if you implement the Distributions.jl interface for your object (the complete component distributions + weights object). The minimum would be

Distribution.logpdf(::YourDist, single_observation) Distribution.pdf(::YourDist, single_observation)

Here single_observation is a going to be a float vector or single row of a table, depending on your choice for target_scitype, as discussed earlier.

I think also StatsBase.params(::YourDist) is normally overloaded, for returning parameters.

And implementing Base.rand would be ideal.

sylvaticus commented 3 years ago

Sorry for the name confusion, I don't have an object for the whole mixture, only for their components.. what I call "mixtures" are actually mixture components. I could easily implement pdf/logpdf(firesults,single_observation), as fitresults includes both the categorical priors and the mixture components, and I could either compute the combined pdf again or run the em algorithm for a single run to retrieve the mixture pdf for single_observation. I would have problems instead to sample, as this would require to invert the CDF. While this is feasible with the current mixture components, it is not required by the EM algorithm, and so it would be an extra information that would need to be provided if one would like to implement other mixture components.

I think for now it is best to stay with the cluster interpretation, that is the one I originally wrote the algorithm for. I hence removed GMMFitter from the list of interfaced models. That's said, if you'd like to use the EM algorithm as a mixture distribution fitter, please open a pull request on BetaML...

ablaom commented 3 years ago

I think for now it is best to stay with the cluster interpretation, that is the one I originally wrote the algorithm for. I hence removed GMMFitter from the list of interfaced models.

Okay, that's fine.

ablaom commented 3 years ago

There is a subtle issue with the scitypes I have just discovered (which is detected by trying to bind the models to test data in a machine).

The input_scitype for the GMMClusterer (https://github.com/sylvaticus/BetaML.jl/blob/58696d0833462d9279727214b069c6429f2e6d9b/src/Clustering_MLJ.jl#L321)is incorrect. It should be the same as the MissingImputer. So

input_scitype    = MMI.Table(Union{MMI.Continuous,MMI.Missing})

which states each column can have a element scitype <:Union{Continuous,Missing}. What is there currently says that each column is either of elscitype Continuous or Missing.

ablaom commented 3 years ago
model = BetaML.Clustering.MissingImputator()

X = [1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38;
     missing -2.3; 5.2 -2.4] |> MLJBase.table

model = BetaML.Clustering.GMMClusterer()
mach = machine(model, X) |> fit!

julia> transform(mach, X)
ERROR: X must me `nothing` in `transform(m::GMMClusterer,firResults,nothing)`. If you want the cluster predictions of new data using already learned structure use `predict(m::GMMClusterer,firResults,Xnew)`

It looks like this error is a left-over from MLJFitter. I'm assuming now that GMMClusterer only has a predict (which works as I would expect), so either don't implement transform(model::GMMClusterer, ...) at all, or implement a version with more appropriate error message, such as "GMMClusterer does not implement transform, only predict. "

ablaom commented 3 years ago

I've looked over the interfaces for the Percetron models and they look good, thanks.

You still have GMMFitter in the name space. This will need to be removed before I can add your new models to the MLJ registry.

sylvaticus commented 3 years ago

I corrected the scitype and removed the GMMFitter.

Concerning the transform function I can surely remove it, but my idea was that transform was somehow different than predict: fit(GMMClusterer,X) returns in fitresults the probabilities _pn,k of each point n to belong to cluster k, as well the data of the mixtures. Then transform(GMMClusterer,fitresults,nothing) "transform" the same data my just returning the _pn,k, while predict(GMMCluster,fitresults,Xnew) predict the clusters in the data of Xnew using the information (mixtures) learned by X, by applying a single E step of the EM algorithm with the mixtures from fitresults as starting point.

Alternatively, I could forget tranform, and just make a check in predict(GMMCluster,fitresults,Xnew): if XNew is nothing, I use the "old" _pn,k I have already learn, otherwise I compute the new ones for Xnew.

In both cases I would like to avoid that, in order to find the cluster probabilities, the user has to pass two times the X data, first in fit and then in predict...

What do you prefer ?

ablaom commented 3 years ago

Thanks for looking into this further. I'm afraid neither option looks great to me.

Then transform(GMMClusterer,fitresults,nothing) "transform" the same data my just returning the p_n,k, while predict(GMMCluster,fitresults,Xnew)

According to the API, the Xnew in predict(model, fr, Xnew) and the Xnew transform(model, fr, Xnew) satisfy the same scitype condition, as articulated by input_scitype, and this is the same scitype that applies to the training data X. Moreover, in every other clustering model, this is the case.

Also, having a transform method for one clustering model having different behaviour from all the others would be a confusing inconsistency, no?

I could forget transform, and just make a check in predict(GMMCluster,fitresults,Xnew): if XNew is nothing, I use the "old" p_n,k I have already learn, otherwise I compute the new ones for Xnew.

I'm not fond of this option either, for the same reasons.

I understand that you want to avoid the user having to re-enter X. In some sense this is already the case. Recall that the user is not using the fit and predict methods that dispatch on models; rather they interact via the machine interface. If you want to predict using the training data you can specify rows=... instead of Xnew in the machine predict, as in this workflow:

using MLJ

model = (@load KMeans pkg=Clustering add=true)()

X, _ = make_moons();

mach = machine(model, X) |> fit!
predict(mach, rows=:)

An alternative solution (that saves one line of code) would be to introduce to MLJ something like a fit_transform or fit_predict method, like you have in sk-learn. There is an open MLJ issue here. Would the existence of such a shortcut address your concerns?

sylvaticus commented 3 years ago

ok, I removed transform(m::GMMClusterer,...) and the _pn,k from fitresults (as these are no longer needed, as they are recomputed in the E-step of predict).

Please feel free to close this issue now, thank you for your patience....

(yes, I still think that a fit_transform looks more natural for clustering tasks, where the objective may be thought to understand the data rather than make predictions)

ablaom commented 3 years ago

Okay. Let me know when you tag a release and I will register all the new models and close this when that's done.

sylvaticus commented 3 years ago

tagged.. thank you again for your support... /Antonello

ablaom commented 3 years ago

Will try to get to this sometime this week.

ablaom commented 3 years ago

@sylvaticus Could you please confirm this is the full list of BetaML models you have intended to add to MLJ?

julia> models("BetaML")
11-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = DecisionTreeClassifier, package_name = BetaML, ... )    
 (name = DecisionTreeRegressor, package_name = BetaML, ... )     
 (name = GMMClusterer, package_name = BetaML, ... )              
 (name = KMeans, package_name = BetaML, ... )                    
 (name = KMedoids, package_name = BetaML, ... )                  
 (name = KernelPerceptronClassifier, package_name = BetaML, ... )
 (name = MissingImputator, package_name = BetaML, ... )          
 (name = PegasosClassifier, package_name = BetaML, ... )         
 (name = PerceptronClassifier, package_name = BetaML, ... )      
 (name = RandomForestClassifier, package_name = BetaML, ... )    
 (name = RandomForestRegressor, package_name = BetaML, ... )     
sylvaticus commented 3 years ago

Yes, I confirm. There is always the possibility to use gmm/em for fitting, but for now it should be fine.

I bit it is late now, but I have also some doubts on the name "MissingImputator". The job is exactly the same ( "matrix completion"), but with that name people may not realise it can be also used for collaborative filtering/recommendation tasks..

ablaom commented 3 years ago

I bit it is late now, but I have also some doubts on the name "MissingImputator". The job is exactly the same ( "matrix completion"), but with that name people may not realise it can be also used for collaborative filtering/recommendation tasks.

Yes this is always tricky. Actually, having two models with different names that happen to do the same thing is not a bad resolution to this problem (so two structs but common methods). For now, I'm going to proceed with the update as is.

ablaom commented 3 years ago

https://github.com/alan-turing-institute/MLJModels.jl/issues/263#issuecomment-823582208