Closed sylvaticus closed 3 years ago
@sylvaticus Thanks for this extra work. I have not had a chance to review the interfaces but will already respond to your nice outline:
Hello, I have interfaced to MLJ the following BetaML models:
KMeans, KMedoids, GMM, MissingImputator
. The MLJ documentation is a bit sparse for unsupervised models, compared to supervised ones, so let me check I done it correctly.KMeans and KMedoids:
- implemented as
<: MMI.Unsupervised
fit
trains the model (compute the centroids/assign the elements) --> fitResultstransform
returns a (nObservation, nClasses) matrix of distances from class centrespredict
returns aCategoricalArray
of the predicted classes given new data (the centroids don't move further)
This is correct. I'm surmising that you found the relevant docs that exist, but FYI there is some here and here and existing implementations here and here
GMM:
- still implemented as
<: MMI.Unsupervised
(there is no distinction in unsupervised models between deterministic and probabilistic)fit
trains the model (estimate the mixtures/assign the probabilities) --> fitResultspredict
returns a vector of UniVariateFinite distributionstransform
is just an alias of predict (not sure what else should it do)
So far we have not implemented any "distribution-fitters" but the design was discussed and the decision made to implement them in different way, namely as supervised models (subtype Probabilistic
) where the data to be fit is y
and X
is nothing
. This actually quite natural, from a perfomance evaluation point-of-view, once you wrap your head around the apparent perversion.
There is a sample implementation linked from this section of the docs (recently tweaked). As this will be our first such model, and the API is marked experimental, there could be some teething issues, but I don't anticipate major issues.
It will be great to finally have GMM!
MissingImputator
- implemented as
<: MMI.Static
transform
return a copy of the X with the missing value imputedPlease let me know if this set-up makes sense for MLJ. The full interface for the new models is here.
I always imagined imputers should learn from the data to prevent data leakage. So, if imputing with mean value, then you learn the mean from the training data and use that for imputing the validation data. That is how the existing MLJ imputer is designed. This kind of imputer cannot be Static
.
If however, the BetaML imputer is genuinely static, in the sense that it transforms data without ever seeing training data, then it should indeed be Static
.
Hope this helps. When you're ready, let me know and I will try to find time to review the interfaces more carefully.
So far we have not implemented any "distribution-fitters" but the design was discussed and the decision made to implement them in different way, namely as supervised models (subtype Probabilistic) where the data to be fit is y and X is nothing. This actually quite natural, from a perfomance evaluation point-of-view, once you wrap your head around the apparent perversion.
There is a sample implementation linked from this section of the docs (recently tweaked). As this will be our first such model, and the API is marked experimental, there could be some teething issues, but I don't anticipate major issues.
I am trying to implement GMM as a <: MMI.Probabilistic
model like you described it, but I have the problem that when I try to create the machine it complains that the number of observations on X(nothing) and Y are not the same:
julia> X,y = Mlj.@load_iris
julia> modelMachine = Mlj.machine(model, nothing, X)
ERROR: DimensionMismatch("Differing number of observations in input and target. ")
Stacktrace:
[1] check(::GMM{DiagonalGaussian{Float64}}, ::MLJBase.Source, ::Vararg{MLJBase.Source, N} where N; full::Bool)
@ MLJBase ~/.julia/packages/MLJBase/4DmTL/src/machines.jl:108
[2] #machine#87
@ ~/.julia/packages/MLJBase/4DmTL/src/machines.jl:249 [inlined]
[3] machine(model::GMM{DiagonalGaussian{Float64}}, raw_arg1::Nothing, raw_args::NamedTuple{(:sepal_length, :sepal_width, :petal_length, :petal_width), NTuple{4, Vector{Float64}}})
@ MLJBase ~/.julia/packages/MLJBase/4DmTL/src/machines.jl:248
[4] top-level scope
@ none:1", "setosa", "setosa", "setosa", "setosa" … "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"])
Let me comment that I think there are several way to evaluate the output of the cluster. In particular, if labels are actually available (but not used in the fitting) the way I implemented accuracy()
in BetaML is to loop over all the permutations that maintain the structure (as the clustering algorithm has no clue on the actual name of the classes) and retain the one for which the accuracy is maximal.
Concerning the MissingInputator, it uses GMM in the background, so missing values are computed as the expected values of the mixtures weighted by the probability of the specific record to be part of such mixture. In my workflow I was thinking to something like:
transform(MissingInputator, Xtrain_with_missing) --> Xtrain_full
transform(MissingInputator, Xtest_with_missing) --> Xtest_full
While I guess you are thinking to:
fit(MissingImputator, Xtrain_with_missing) --> fitresults
predict(MissingImputator,fitresults,Xtrain_with_missing) -> Xtrain_full
predict(MissingImputator,fitresults,Xtest_with_missing) -> Xtest_full
Right ?
How should then implement MissingInputator ? As :< MMI.Deterministic
model ?
It will be great to finally have GMM!
Great, but please consider that the model is not computationally efficient (but for a few thousand of data it works fine). However it is very flexible, as it can use any mixture (gaussian ones being the only one implemented) and, contrary to scikit-learn gmm model, it accepts X matrices with missing data (hence I can use it for missing imputation ...).
EDIT: Concerning the MissingInputator, I think I did manage to support both the strategies:
transform(m::MissingInputator, X_with_missing)
I obtain a X_full
without learning from previous datafit(m::MissingInputator, X_with_or_without_missing)
and then predict(m::MissingInputator, fitresults, X2_with_missing)
I obtain a X2_full
using the learning information from XI am trying to implement GMM as a <: MMI.Probabilistic model like you described it, but I have the problem that when I try to create the machine it complains that the number of observations on X(nothing) and Y are not the same:
You need the latest version of MLJBase where I relaxed this check for just this case.
Will look over your other comments Monday.
Great, thanks...
Concerning the MissingInputator, it uses GMM in the background, so missing values are computed as the expected values of the mixtures weighted by the probability of the specific record to be part of such mixture. In my workflow I was thinking to something like:
transform(MissingInputator, Xtrain_with_missing) --> Xtrain_full transform(MissingInputator, Xtest_with_missing) --> Xtest_full
While I guess you are thinking to:
fit(MissingImputator, Xtrain_with_missing) --> fitresults predict(MissingImputator,fitresults,Xtrain_with_missing) -> Xtrain_full predict(MissingImputator,fitresults,Xtest_with_missing) -> Xtest_full
Right ?
Well, almost. Rather:
fit(MissingImputator, Xtrain_with_missing) --> fitresults
transform(MissingImputator,fitresults,Xtrain_with_missing) -> Xtrain_full
transform(MissingImputator,fitresults,Xtest_with_missing) -> Xtest_full
I am not suggesting that MissingImputor
be implemented as a Supervised
model. The choices I am considering are:
plain Unsupervised
, if one separates training from transformation (as in the above workflow); then the signatures must look like MMI.fit(model, verbosity, Xtrain)
and MMI.transform(model, fitresult, Xnew)
.
Static
(a subtype of Unsupervised
), if there is no training of consequence. Then fit
is never implemented (it has a no-op fallback) and we have signature transform(model, fitresult, X...)
. I don't think this applies here.
Concerning the MissingInputator, I think I did manage to support both the strategies:
- with
transform(m::MissingInputator, X_with_missing)
I obtain aX_full
without learning from previous data- with
fit(m::MissingInputator, X_with_or_without_missing)
and thenpredict(m::MissingInputator, fitresults, X2_with_missing)
I obtain aX2_full
using the learning information from X
This is not compatible with the API. Your MMI.transform
method must always include a fitresult
in the second slot.
The workflow you are after, I'm guessing, is a "fit and transform in one hit" option, like sk-learn's fit_transform
method. We don't have that in the current API, although there's the possibility to at least offer this on the user side as a shortcut (https://github.com/alan-turing-institute/MLJ.jl/issues/546). Rather, if you want to fit and transform on the same data, then you just have to do it in steps. So, assuming you implement the imputation as a Unsupervised
model, the user workflow to fit and transform on the same data (ie, to reproduce your first workflow on the user side): is
mach = machine(MissingImputator(), Xtrain) |> fit!
Xtrain_full = transform(mach, Xtrain)
mach = machine(MissingImputator(), Xtest) |> fit!
Xtest_full = transform(mach, Xtest)
Whereas, a non-leakage workflow would replace the second block with
Xtest_full = transform(mach, Xtest)
(Or write the whole thing as
mach = machine(MissingImputator(), X)
fit!(mach, rows=train)
Xfull = transform(mach, X)
...
)
Does that address your question?
Yes, thanks I think I got it this time. I implemented MissingImputator
as a unsupervised model as you suggested:
fit(m::MissingImputator, verbosity, X) --> (fitResults, cache, report)
transform(m::MissingImputator, fitResults, X) --> table(X̂_full)
I test them with:
julia> import MLJBase
julia> const Mlj = MLJBase
MLJBase
julia> using BetaML
julia> X = [1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38; missing -2.3; 5.2 -2.4]
9×2 Matrix{Union{Missing, Float64}}:
1.0 10.5
1.5 missing
1.8 8.0
1.7 15.0
3.2 40.0
missing missing
3.3 38.0
missing -2.3
5.2 -2.4
julia> X = Mlj.table(X)
Tables.MatrixTable{Matrix{Union{Missing, Float64}}}: (x1 = Union{Missing, Float64}[1.0, 1.5, 1.8, 1.7, 3.2, missing, 3.3, missing, 5.2], x2 = Union{Missing, Float64}[10.5, missing, 8.0, 15.0, 40.0, missing, 38.0, -2.3, -2.4])
julia> model = MissingImputator()
MissingImputator(
K = 3,
p₀ = nothing,
mixtures = DiagonalGaussian{Float64}[DiagonalGaussian{Float64}(nothing, nothing), DiagonalGaussian{Float64}(nothing, nothing), DiagonalGaussian{Float64}(nothing, nothing)],
tol = 1.0000000000000004e-6,
minVariance = 0.05,
minCovariance = 0.0,
initStrategy = "kmeans",
rng = Random._GLOBAL_RNG()) @761
julia> modelMachine = Mlj.machine(model,X)
Machine{MissingImputator{DiagonalGaussian{Float64}},…} @335 trained 0 times; caches data
args:
1: Source @274 ⏎ `ScientificTypes.Table{AbstractVector{Union{Missing, ScientificTypes.Continuous}}}`
julia> (fitResults, cache, report) = Mlj.fit(model, 0, X)
((pₖ = [0.49997459736828426; 0.25001270131585945; 0.25001270131585634], mixtures = DiagonalGaussian{Float64}[DiagonalGaussian{Float64}([1.5, 11.166666666666666], [0.05, 0.05]), DiagonalGaussian{Float64}([3.2499999999999782, 39.0], [0.05, 0.05]), DiagonalGaussian{Float64}([5.199999999999994, -2.3499999999999996], [0.05, 0.05])]), nothing, ([2.886751345948105, 0.1814436846505966, 0.020160409405627352, 0.0022400454895143513], -275.7794472196741, 582.3200385220554, 579.5588944393483))
julia> XD = Mlj.transform(model,fitResults,X)
Tables.MatrixTable{Matrix{Float64}}: (x1 = [1.0, 1.5, 1.8, 1.7, 3.2, 2.8625692221714156, 3.3, 5.199999999999994, 5.2], x2 = [10.5, 11.166666666667362, 8.0, 15.0, 40.0, 14.746015173838764, 38.0, -2.3, -2.4])
julia> XDM = Mlj.matrix(XD)
9×2 Matrix{Float64}:
1.0 10.5
1.5 11.1667
1.8 8.0
1.7 15.0
3.2 40.0
2.86257 14.746
3.3 38.0
5.2 -2.3
5.2 -2.4
julia> # Use the previously learned structure to imput missings..
Xnew_withMissing = Mlj.table([1.5 missing; missing 38; missing -2.3; 5.1 -2.3])
Tables.MatrixTable{Matrix{Union{Missing, Float64}}}: (x1 = Union{Missing, Float64}[1.5, missing, missing, 5.1], x2 = Union{Missing, Float64}[missing, 38.0, -2.3, -2.3])
julia> XDNew = Mlj.transform(model,fitResults,Xnew_withMissing)
Tables.MatrixTable{Matrix{Float64}}: (x1 = [1.5, 3.2499999999999782, 5.199999999999994, 5.1], x2 = [11.166666666667362, 38.0, -2.3, -2.3])
julia> XDMNew = Mlj.matrix(XDNew)
4×2 Matrix{Float64}:
1.5 11.1667
3.25 38.0
5.2 -2.3
5.1 -2.3
Could you please check and validate the clusters MLJ interface and the Perceptron one ?
Thanks for the work on these interfaces. I've just had a look at the clustering ones, but not the perceptron.
The metadata looks good, except output_scitype
is meant to refer to the scitype
of the output of the transform
method, which here is
Table(Continuous)
. (You can specify a target_scitype
for an
unsupervised model, which falls back to Unknown
and is not really used anywhere that I know of. Still it wouldn't hurt
if you want to declare it to be AbstractVector{<:MultiClass}
, a
supertype of the scitype of the output of predict
.)
However, I notice some strange behaviour in the KMeans model noted below:
using Random
Random.seed!(123)
X, _ = MLJBase.make_regression(20, 8);
mach = MLJBase.machine(KMeans(), X) |> MLJBase.fit!;
# It appears we have three clusters:
julia> MLJBase.schema(MLJBase.transform(mach, X))
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1 │ Float64 │ Continuous │
│ x2 │ Float64 │ Continuous │
│ x3 │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 20
# Bit surprised that all training data is assigned a single class:
yhat = MLJBase.predict(mach, X)
20-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
# And I would say the following is incorrect, as the predicted `yhat` should have
# three levels in its pool, not just the ones that are manifest for
# the particular input `X` in definition of `yhat`:
julia> MLJBase.classes(yhat)
1-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
2
For comparison, here is the behaviour of the Clustering.jl model:
using MLJ
KMeans = @load KMeans pkg=Clustering
model = KMeans()
using Random
Random.seed!(123)
X, _ = make_regression(20, 8);
mach = machine(KMeans(), X) |> fit!;
schema(transform(mach, X))
# again, three clusters:
julia> schema(transform(mach, X))
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1 │ Float64 │ Continuous │
│ x2 │ Float64 │ Continuous │
│ x3 │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 20
yhat = predict(mach, X)
20-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
3
1
1
3
1
3
3
1
2
3
3
1
3
3
1
3
1
1
1
# All clusters are tracked, even if not explicitly manifest:
julia> ysmall = predict(mach, selectrows(X, 1:2))
2-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
3
julia> classes(ysmall)
3-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
1
2
3
Finally, can I suggest you overload MMI.fitted_params
to make them more user-friendly. So, the fallback is giving:
MLMBase.fitted_params(mach)
(fitresult = ([3, 3, 2, 3, 1, 1, 3, 1, 2, 2 … 2, 3, 3, 3, 3, 3, 2, 3, 1, 1], [-0.09844314369373444 -0.9399263119056788; -0.994364450422251 0.5485269272699657; 1.0271438614771895 0.7414183922931447], BetaML.Clustering.var"#39#41"()),)
Maybe something like
MMI.fitted_params(model::KMeans, fitresult) = (centers=fitesult[2], cluster_labels=CategoricalArrays.categorical(fitresults[1]))
The metadata looks good. However, I get an error when I try to use this as I expected it to work (cf the example referred to in docs, which is here)
using MLJBase
import BetaML
GMM = BetaML.Clustering.GMM
y, _ = make_regression(1000, 3, rng=123);
julia> schema(y)
┌─────────┬─────────┬────────────┐
│ _.names │ _.types │ _.scitypes │
├─────────┼─────────┼────────────┤
│ x1 │ Float64 │ Continuous │
│ x2 │ Float64 │ Continuous │
│ x3 │ Float64 │ Continuous │
└─────────┴─────────┴────────────┘
_.nrows = 1000
mach = machine(GMM(), nothing, y) |> fit!
# expecting a multivariate distrubution here:
julia> d = predict(mach, nothing)
ERROR: ArgumentError: Function `matrix` only supports AbstractMatrix or containers implementing the Tables interface.
Stacktrace:
[1] matrix(::MLJModelInterface.FullInterface, ::Val{:other}, ::Nothing; kw::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:32
[2] matrix(::MLJModelInterface.FullInterface, ::Val{:other}, ::Nothing) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:32
[3] matrix(::Nothing; kw::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:27
[4] matrix(::Nothing) at /Users/anthony/.julia/packages/MLJModelInterface/tegnW/src/data_utils.jl:27
[5] predict(::BetaML.Clustering.GMM{BetaML.Clustering.DiagonalGaussian{Float64}}, ::NamedTuple{(:pₖ, :mixtures),Tuple{Array{Float64,2},Array{BetaML.Clustering.DiagonalGaussian{Float64},1}}}, ::Nothing) at /Users/anthony/.julia/packages/BetaML/1uT9C/src/Clustering_MLJ.jl:149
[6] predict(::Machine{BetaML.Clustering.GMM{BetaML.Clustering.DiagonalGaussian{Float64}},true}, ::Nothing) at /Users/anthony/.julia/packages/MLJBase/pCCd7/src/operations.jl:83
[7] top-level scope at REPL[16]:1
Thanks, I'll have a look on this...
Metadata of KMedoids and KMeans
input_scitype = MMI.Table(MMI.Continuous), # scitype of the inputs
output_scitype = MMI.Table(MMI.Continuous), # scitype of the output of `transform`
target_scitype = AbstractArray{<:MMI.Multiclass}, # scitype of the output of `predict`
KMeans homogeneous predictions
initStrategy
to "random" (it was due to an unfortunate "grid" init strategy that for random variables and 3 classes it was selecting and accumulating points toward the central class)Missing factors in KMeans and KMedoids output
return CategoricalArray(assignedClasses,levels=1:nCl)
More user-friendly fitted parameters for KMEans/KMedoids
Error in predict(GMM)
It seems there could be a problem in the MLJ design/architecture (still a bit voodoo for me...)
This works:
using MLJBase, BetaML
y, _ = make_regression(1000, 3, rng=123);
ym = MLJBase.matrix(y)
model = GMM(rng=copy(BetaML.FIXEDRNG))
(fitResults, cache, report) = MLJBase.fit(model, 0, nothing, y)
yhat_prob = MLJBase.transform(model,fitResults,y) # ok
yhat_prob = MLJBase.predict(model, fitResults, y) # ok
However if we run instead:
modelMachine = MLJBase.machine(model, nothing, y)
mach = MLJBase.fit!(modelMachine)
yhat_prob = MLJBase.predict(mach, nothing)
we have the error that you reported.
This is due to the fact that in yhat_prob = MLJBase.predict(model, fitResults, y)
the third parameter, the "X" in MMI.predict(m::GMM, fitResults, X)
is a Tables.MatrixTable{Matrix{Float64}}
, while in MLJBase.predict(mach, nothing)
it is nothing
, and then of course MMI.matrix(X)
complains about it.
How can I construct the interface so that the "version" of predict
that uses the "machine" understand that the X is embedded in the machine ?
Thanks for these new changes. I'll look over these and Perceptron shortly.
It seems there could be a problem in the MLJ design/architecture (still a bit voodoo for me...)
Nah. I think you just need to delete this irrelevant line:
X
is alwaysnothing
and you make no use of it in the rest of the method.
Arghh... yes, sure... sorry...
EDIT
Actually, no I was using it 4 lines later, in the gmm(x;...)
function.
My original understood was that fit
, for a GMM model, was responsible to compute the mixtures. The posterior probabilities (probabilistic class assignments) were not even included in fitresults
.
predict
than would have used the information of the mixtures provided by fit
to compute the posterior probabilities (by running a single E-step of the EM algorithm) of X, either the "original" X or a new matrix. The idea was that this could have allowed to estimate the class probabilities for a new dataset, perhaps a small one, assuming the observations were identically distributed of the original X, i.e. without running the EM algorithm on this second set.
Any how, I have now changed it, fitresults
returned by fit
includes the class probabilities and predict
returns the class probabilities of the original X in the required format. I think this is what you originally meant....
No, what I had in mind was that predict
return a single multivariate mixture distribution, d
. Then, for example, one could call pdf(d, y)
on a single new observation y
(you probably want me to call this x
) to get a single probability. A technical issue I have overlooked is that y
is going to be the row of a table, but the mixture distributions in Distribution.jl have ordinary Euclidean (Vector{Float64}
) support.
In any case, I can see there is some tension here between two competing conceptualisations of GMM. We want either:
to view GMM as a "distribution fitter", in which case I maintain the integration I have suggested is the appropriate one (modulo technical difficulties we should sort out anyway); or
to view GMM as a "probabilistic" clustering algorithm, ie, a clusterer that predicts a categorical distribution instead of a single class. I see that this is how sk-learn conceptualises GMM, for example.
In either case, interface points can be introduced to get the "secondary" objects, but those interface points will be less discoverable to the user. So we should probably choose according to the greater use-case. On these grounds, my push for case 1 may have been mistaken, and I suspect you agree, yes?
If we revert to case 2, then GMM becomes Unsupervised
and predict(::GMM, fitresult, Xnew)
outputs a vector of UnivariateFinite
element type for new data. It would be good to overload fitted_params
to return a named tuple which includes the keys mixtures
and weights
(or whatever you want to call them), and perhaps even the Distribution.jl object I discuss above (with the Euclidean support).
Thanks for you patience. What are your thoughts?
Please also see this side issue: https://github.com/sylvaticus/BetaML.jl/issues/20
A quick reply: building the interface is a few lines of code. We can have two separate MLJ models that wrap BetaML.gmm for the two interpretations. We have to take care of the names however not to generate confusion. I'll have a look...
Yes, that's a very good idea, thanks. Maybe GMMFitter
and GMMClusterer
??
A technical issue I have overlooked is that y is going to be the row of a table, but the mixture distributions in Distribution.jl have ordinary Euclidean (Vector{Float64}) support.
In the "distribution" fitting case we could make the following pragmatic choice: the training "target" y
(in training data = (nothing, y)
) is be a matrix instead of a table. That is target_scitype(...) = AbstractMatrix{Continuous}
. Then we can use Distributions.jl to output an object d
with a consistent representation. That is, y
is matrix whose rows are observations, and the call pdf(d, selectrows(y, i))=pdf(d, y[i, :])
now makes sense.
Hello, I did implement GMMClusterer
, but I have troubles to see for GMMFitter
how to return in predict
the Distribution object, as the mixturer I use aren't Distributions.jl object, but rather are custom types defined in BetaML.
Does your object implement any of the Distributions.jl API, such as pdf
?
They implement only what is needed for the EM algo, that is the log pdf (lpdf
):
{Spherical|Diagonal|Full}Gaussian mixtures for gmm
/ predictMissing
are already provided. User defined mixtures can be used defining a struct as subtype of Mixture
and implementing for that mixture the following functions:
initMixtures!(mixtures, X; minVariance, minCovariance, initStrategy)
lpdf(m,x,mask)
(for the e-step)updateParameters!(mixtures, X, pₙₖ; minVariance, minCovariance)
(the m-step)What about converting this to a Distributions.MixtureModel
? You already have Distributions
as a dependency, right?
Does Distributions
not already provide the atomic distributions you need?
If I remember correctly the "problem" was mosty with missing data.
I'll check better tomorrow the mixture model of Distributions
.
Second thoughts, I think it's just fine, maybe even better, if you implement the Distributions.jl interface for your object (the complete component distributions + weights object). The minimum would be
Distribution.logpdf(::YourDist, single_observation) Distribution.pdf(::YourDist, single_observation)
Here single_observation
is a going to be a float vector or single row of a table, depending on your choice for target_scitype
, as discussed earlier.
I think also StatsBase.params(::YourDist)
is normally overloaded, for returning parameters.
And implementing Base.rand
would be ideal.
Sorry for the name confusion, I don't have an object for the whole mixture, only for their components.. what I call "mixtures" are actually mixture components.
I could easily implement pdf/logpdf(firesults,single_observation), as fitresults
includes both the categorical priors and the mixture components, and I could either compute the combined pdf again or run the em algorithm for a single run to retrieve the mixture pdf for single_observation
.
I would have problems instead to sample, as this would require to invert the CDF. While this is feasible with the current mixture components, it is not required by the EM algorithm, and so it would be an extra information that would need to be provided if one would like to implement other mixture components.
I think for now it is best to stay with the cluster interpretation, that is the one I originally wrote the algorithm for. I hence removed GMMFitter
from the list of interfaced models.
That's said, if you'd like to use the EM algorithm as a mixture distribution fitter, please open a pull request on BetaML...
I think for now it is best to stay with the cluster interpretation, that is the one I originally wrote the algorithm for. I hence removed GMMFitter from the list of interfaced models.
Okay, that's fine.
There is a subtle issue with the scitypes I have just discovered (which is detected by trying to bind the models to test data in a machine).
The input_scitype
for the GMMClusterer
(https://github.com/sylvaticus/BetaML.jl/blob/58696d0833462d9279727214b069c6429f2e6d9b/src/Clustering_MLJ.jl#L321)is incorrect. It should be the same as the MissingImputer
. So
input_scitype = MMI.Table(Union{MMI.Continuous,MMI.Missing})
which states each column can have a element scitype <:Union{Continuous,Missing}
. What is there currently says that each column is either of elscitype Continuous
or Missing
.
model = BetaML.Clustering.MissingImputator()
X = [1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38;
missing -2.3; 5.2 -2.4] |> MLJBase.table
model = BetaML.Clustering.GMMClusterer()
mach = machine(model, X) |> fit!
julia> transform(mach, X)
ERROR: X must me `nothing` in `transform(m::GMMClusterer,firResults,nothing)`. If you want the cluster predictions of new data using already learned structure use `predict(m::GMMClusterer,firResults,Xnew)`
It looks like this error is a left-over from MLJFitter
. I'm assuming now that GMMClusterer
only has a predict
(which works as I would expect), so either don't implement transform(model::GMMClusterer, ...)
at all, or implement a version with more appropriate error message, such as "GMMClusterer
does not implement transform
, only predict
. "
I've looked over the interfaces for the Percetron models and they look good, thanks.
You still have GMMFitter
in the name space. This will need to be removed before I can add your new models to the MLJ registry.
I corrected the scitype and removed the GMMFitter
.
Concerning the transform
function I can surely remove it, but my idea was that transform
was somehow different than predict
:
fit(GMMClusterer,X)
returns in fitresults
the probabilities _pn,k of each point n to belong to cluster k, as well the data of the mixtures.
Then transform(GMMClusterer,fitresults,nothing)
"transform" the same data my just returning the _pn,k, while predict(GMMCluster,fitresults,Xnew)
predict the clusters in the data of Xnew
using the information (mixtures) learned by X
, by applying a single E step of the EM algorithm with the mixtures from fitresults
as starting point.
Alternatively, I could forget tranform
, and just make a check in predict(GMMCluster,fitresults,Xnew)
: if XNew
is nothing
, I use the "old" _pn,k I have already learn, otherwise I compute the new ones for Xnew
.
In both cases I would like to avoid that, in order to find the cluster probabilities, the user has to pass two times the X data, first in fit
and then in predict
...
What do you prefer ?
Thanks for looking into this further. I'm afraid neither option looks great to me.
Then transform(GMMClusterer,fitresults,nothing) "transform" the same data my just returning the p_n,k, while predict(GMMCluster,fitresults,Xnew)
According to the API, the Xnew
in predict(model, fr, Xnew)
and the
Xnew
transform(model, fr, Xnew)
satisfy the same scitype
condition, as articulated by input_scitype
, and this is the same
scitype
that applies to the training data X
. Moreover, in every other
clustering model, this is the case.
Also, having a transform
method for one clustering model having
different behaviour from all the others would be a confusing
inconsistency, no?
I could forget transform, and just make a check in predict(GMMCluster,fitresults,Xnew): if XNew is nothing, I use the "old" p_n,k I have already learn, otherwise I compute the new ones for Xnew.
I'm not fond of this option either, for the same reasons.
I understand that you want to avoid the user having to re-enter
X
. In some sense this is already the case. Recall that the user is
not using the fit
and predict
methods that dispatch on models;
rather they interact via the machine interface. If you want to
predict
using the training data you can specify rows=...
instead of Xnew
in the machine predict
, as in this workflow:
using MLJ
model = (@load KMeans pkg=Clustering add=true)()
X, _ = make_moons();
mach = machine(model, X) |> fit!
predict(mach, rows=:)
An alternative solution (that saves one line of code) would be to
introduce to MLJ something like a fit_transform
or fit_predict
method, like you have in sk-learn. There is an open MLJ issue
here. Would
the existence of such a shortcut address your concerns?
ok, I removed transform(m::GMMClusterer,...)
and the _pn,k from fitresults (as these are no longer needed, as they are recomputed in the E-step of predict
).
Please feel free to close this issue now, thank you for your patience....
(yes, I still think that a fit_transform
looks more natural for clustering tasks, where the objective may be thought to understand the data rather than make predictions)
Okay. Let me know when you tag a release and I will register all the new models and close this when that's done.
tagged.. thank you again for your support... /Antonello
Will try to get to this sometime this week.
@sylvaticus Could you please confirm this is the full list of BetaML models you have intended to add to MLJ?
julia> models("BetaML")
11-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
(name = DecisionTreeClassifier, package_name = BetaML, ... )
(name = DecisionTreeRegressor, package_name = BetaML, ... )
(name = GMMClusterer, package_name = BetaML, ... )
(name = KMeans, package_name = BetaML, ... )
(name = KMedoids, package_name = BetaML, ... )
(name = KernelPerceptronClassifier, package_name = BetaML, ... )
(name = MissingImputator, package_name = BetaML, ... )
(name = PegasosClassifier, package_name = BetaML, ... )
(name = PerceptronClassifier, package_name = BetaML, ... )
(name = RandomForestClassifier, package_name = BetaML, ... )
(name = RandomForestRegressor, package_name = BetaML, ... )
Yes, I confirm. There is always the possibility to use gmm/em for fitting, but for now it should be fine.
I bit it is late now, but I have also some doubts on the name "MissingImputator". The job is exactly the same ( "matrix completion"), but with that name people may not realise it can be also used for collaborative filtering/recommendation tasks..
I bit it is late now, but I have also some doubts on the name "MissingImputator". The job is exactly the same ( "matrix completion"), but with that name people may not realise it can be also used for collaborative filtering/recommendation tasks.
Yes this is always tricky. Actually, having two models with different names that happen to do the same thing is not a bad resolution to this problem (so two structs but common methods). For now, I'm going to proceed with the update as is.
Hello, I have interfaced to MLJ the following BetaML models:
KMeans, KMedoids, GMM, MissingImputator
. The MLJ documentation is a bit sparse for unsupervised models, compared to supervised ones, so let me check I done it correctly.KMeans and KMedoids:
<: MMI.Unsupervised
fit
trains the model (compute the centroids/assign the elements) --> fitResultstransform
returns a (nObservation, nClasses) matrix of distances from class centrespredict
returns aCategoricalArray
of the predicted classes given new data (the centroids don't move further)GMM:
<: MMI.Unsupervised
(there is no distinction in unsupervised models between deterministic and probabilistic)fit
trains the model (estimate the mixtures/assign the probabilities) --> fitResultspredict
returns a vector of UniVariateFinite distributionstransform
is just an alias of predict (not sure what else should it do)MissingImputator
<: MMI.Static
transform
return a copy of the X with the missing value imputedPlease let me know if this set-up makes sense for MLJ. The full interface for the new models is here.