JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.76k stars 157 forks source link

Register new interface package #825

Open willtebbutt opened 2 years ago

willtebbutt commented 2 years ago

An interface package between MLJ and the AbstractGPs ecosystem is in the process of being registered in the general registry, so I would like to also register the package with MLJ: https://github.com/willtebbutt/MLJAbstractGPsGlue.jl

The README of MLJModels.jl directed me to open an issue here. What steps will I need to take following the registration of MLJAbstractGPsGlue.jl in the general registry to complete registration with MLJ?

ablaom commented 2 years ago

@willtebbutt Thanks for taking the time to wrap your head around the MLJ API and to write this interface. I realize this is a bit of work. It will be great to finally have some GP models.

The main issue I see at present is that predict is not currently delivering the right kind of object, and no method provided is doing so. The current output of predict is what predict_joint should return, namely a single joint distribution object representing p(y1, y2, ..., yn | x1, x2, ..., xn), yes? (Here n is the number of observations=rows in input Xnew.) Specifically, the (experimental) API says this should have type Distributions.Sampleable{<:Multivariate,V}, where scitype(V) = target_scitype(SomeModel); is this (approximately) true?

On the other hand, thepredict method, which is compulsory, must return a vector of univariate distributions - one per input observation - rather than a single multivariate one. That is, we want [p(y | x1), p(y | x2), ..., p(y | xn)], which cannot be deduced from the previous distribution in general. This is what every other probabilistic model in MLJ currently outputs for predict.

Some of the discussion regarding SossMLJ.jl, which is attempting something similar, may be useful. See, in particular, this comment.

cc@DilumAluthge

Other minor comments:

willtebbutt commented 2 years ago

Thanks for the detailed feedback @ablaom .

I've got a PR open here addressing your feedback: https://github.com/willtebbutt/MLJAbstractGPsGlue.jl/pull/2

On the other hand, thepredict method, which is compulsory, must return a vector of univariate distributions - one per input observation - rather than a single multivariate one. That is, we want [p(y | x1), p(y | x2), ..., p(y | xn)], which cannot be deduced from the previous distribution in general. This is what every other probabilistic model in MLJ currently outputs for predict.

Ah, I had misunderstood the conclusions here. Have now changed this.

One needs to declare the input_scitype; currently your wrapper seems to limit X to being a matrix but I don't see why you couldn't make this a table and materialise the matrix internally using MLJModelInterface.matrix(X) with keyword transpose=true if necessary. This is the typical case for MLJ models like this one. In that case the input_scitype would be STB.Table(STB.Continuous) where STB=ScientificTypesBase. You could allow both matrix and table, in which case use a union type. Happy to provide further help here if needed.

I see. I've changed the comment to reflect this (I think I was actually doing this anyway). I'm a little uncertain as to exactly what the input scitype should be now though -- I'm happy to accept anything that the MLJModelInterface.matrix function works with, but it doesn't necessarily have to be continuous, so I'd rather not impose that restriction. What would you suggest doing here?

package name at this line should be the core package MLJAbstractGPs not the glue package, Ditto the package url, but the current load_path is correct. The package identifiers are really a way of pointing users to documentation, and as a convenient handle for models when multiple packages provide the same model.

I think I've addressed this.

Probably don't need MLJ in testing - MLJBase should suffice

Excellent -- I've removed it.

overloading predict_mean is good if it improves performance, but it should have the appropriate interpretation, ie it should be element-wise mean of each distribution returned by predict, and not the mean of predict_joint which I think is different, no?

Ah, I think I'm actually doing the correct thing here, based on your description.

The fact that the hyper-parameters in your MLJ-facing model struct are immutable will mean limited buy-in for hyper-parameter optimization. For example, if you wanted to optimise model.initial_parameters.noise_variance.unconstrained_value, you couldn't do this directly, because resetting model.initial_parameters.noise_variance.unconstrained_value = 0.001 will throw an error. (Note the model structs are cloned before such mutations are attempted.) Perhaps important parameters for tuning could be exposed directly, which would be more convenient for users perhaps??

I think this is probably the most tricky comment to address. Do the hyperparameter optimisers in MLJ know how to work with a field which contains an unconstrained vector of real numbers? More generally, does MLJ provide the tools to test that I've satisfied the assumptions made by hyperparameter optimisation algorithms anywhere?

On a more general note: does MLJ provide functionality that model-implementers can hook into to determine whether they've correctly implemented the interface that MLJ requires? Perhaps I've just missed that kind of functionality? For example, we provide API tests here in AbstractGPs.jl, so that new implementers of AbstractGPs can check that they've satisfied the assumptions that the API requires.

willtebbutt commented 2 years ago

Another more general question. What's MLJ's policy regarding publishing updates to models? It is easy for model-implementers to publish updates to their offering in MLJ?

I ask because we're anticipating needing to experiment quite a lot initially as we're not quite sure what options we want to serve up to users -- the current implementation is definitely just a first-pass.

ablaom commented 2 years ago

Thanks indeed for looking into this further.

I'm a little uncertain as to exactly what the input scitype should be now though -- I'm happy to accept anything that the MLJModelInterface.matrix function works with, but it doesn't necessarily have to be continuous, so I'd rather not impose that restriction. What would you suggest doing here?

Well, the MLJ philosophy is different scitype => different model. So the ideal solution be to create different model structs for each input_scitype one intends to support, with the user-selected GP being restricted appropriately. I don't know how doable that is for your "generic" API.

You could instead leave the type as Unknown, which has the consequence that certain type checks are dodged. But this will also make your models less discoverable (if we register them). Users match models to data by searching the meta-data associated with model in the registry. Your model will likely never appear in the results of that search. At the other extreme, you could declare the input_scitype to be ScientificTypes.Table (any tabular data) but then users will expect the model to work with any table at all, for all hyper-parameters you provide. This might be tenable if you implement a clean! method returning suitably informative error messages.

Do the hyperparameter optimisers in MLJ know how to work with a field which contains an unconstrained vector of real numbers?

Not sure what you mean here by "constrained". By length or type? The type of a parameter need not be known before-hand but it generally cannot change during optimization. A field could be a vector of variable length, but access would still need to be via names, not indices. Names could be hard-wired or specified by user at construction, as for Stack. I've just added a section in the docs to explain in more detail what is currently required. For my part, I'm open to suggestions for extending access but am wary of extensions that will add new conceptual burden to the user.

More generally, does MLJ provide the tools to test that I've satisfied the assumptions made by hyperparameter optimisation algorithms anywhere?

No, but I've made a PR to add requirements to the docs (see above). You can test optimization by mimicking one of the examples in the tuning docs.

Ah, I think I'm actually doing the correct thing here, based on your description.

Mmm. I'm not so sure. The new predict is predicting marginals of the joint (multivariate) distribution, yes?, which is not what is required:

That is, we want [p(y | x1), p(y | x2), ..., p(y | xn)], which cannot be deduced from the previous distribution in general.

Here n is the number of observations in the provided input Xnew. In particular, the new predict should have the following property, following from the above definition:

predict(mach, Xnew[1, :])[1] == predict(mach, Xnew)[1, :]        # n > 1

I realize that Bayesian's are interested in the marginals of the joint, but these are already accessible from the joint distribution (output of predict_joint). However, to enable a comparing-apples-with-apples comparison of your models with other MLJ models we need something different for predict: the distibution of the univariate target y given a single observation x, for each observation x in Xnew. I don't believe this is the same as marginalising the joint distribution on a single componentx because of correlations.

One way to think of predict is that you are computing the joint distribution but with with n=1 (so not really multivariate anymore) for each individual observation x_i in Xnew, and then combining these distributions into a vector of length n. See also the SossMLJ.jl implementation.

If you are still not convinced, please carefully read this comment which arose of this kind of misunderstanding in the Soss integration.

What's MLJ's policy regarding publishing updates to models? It is easy for model-implementers to publish updates to their offering in MLJ?

MLJ does not lock the user into any particular version of your your models. Your package is never a dependency of any MLJ package. Users control the version of a model used by choosing an appropriate version of the interface package they have in their combined environment. If you make breaking changes to a model, MLJ will not trigger a new breaking release. If you change model meta-data (the values of all those traits, such as input_scitype) then you should raise an issue at MLJModels to force an update of the registry metadata but sadly there is no mechanism for keeping the registry version synchronised with the algorithm providing package. (As far as I know, this has not caused yet caused issues.) If meta-data is likely to change in the new future, I recommend we hold off registering the interface package.

willtebbutt commented 2 years ago

Well, the MLJ philosophy is different scitype => different model. So the ideal solution be to create different model structs for each input_scitype one intends to support, with the user-selected GP being restricted appropriately. I don't know how doable that is for your "generic" API.

I see. I think it might better for us to leave this as unknown and rely on making users aware of the MLJ interface via other channels (on JuliaGP git repos etc) -- presumably if the users know what they're looking for, they'll be able to find it? Restricting the input type of a model to a single scitype would really be overly restrictive for us -- at the very least users need to be able to specify whichever kernel they fancy (of which there an unbounded number), and arbitrary kernels can accept arbitrary data types, so I don't think restrictions are generally possible in our case.

Not sure what you mean here by "constrained". By length or type? The type of a parameter need not be known before-hand but it generally cannot change during optimization. A field could be a vector of variable length, but access would still need to be via names, not indices. Names could be hard-wired or specified by user at construction, as for Stack. I've just added a section in the docs to explain in more detail what is currently required. For my part, I'm open to suggestions for extending access but am wary of extensions that will add new conceptual burden to the user.

Sorry, I just mean whether the hyperparameter optimisers know how to handle things like v in this model:

mutable struct MyModel{Tv<:Vector{<:Real}}
    v::Tv
end

I've taken a look at the linked description and I'm still not quite sure! I guess the hyperparameters in our models are a little unusual in that you're not usually interested in their values as a user, rather you're just interested in what the output of fit is, because the hyperparameters are just used as initialisation for an optimisation procedure inside fit which has some local optima.

No, but I've made a PR to add requirements to the docs (see above). You can test optimization by mimicking one of the examples in the tuning docs.

I'll take a look at these and see if I can apply one!

Mmm. I'm not so sure. The new predict is predicting marginals of the joint (multivariate) distribution, yes?,

No, the marginals of the joint are just the marginals in this the context -- I presume this is the difference between conditioning on new observations (seeing additional ys), and asking for predictions at multiple locations (additional xs). I've added tests to this effect here. The only difference (up to numerics) in this case is that asking for all of the marginals at once is more performant than asking for one at a time.

MLJ does not lock the user into any particular version of your your models. Your package is never a dependency of any MLJ package. Users control the version of a model used by choosing an appropriate version of the interface package they have in their combined environment. If you make breaking changes to a model, MLJ will not trigger a new breaking release. If you change model meta-data (the values of all those traits, such as input_scitype) then you should raise an issue at MLJModels to force an update of the registry metadata but sadly there is no mechanism for keeping the registry version synchronised with the algorithm providing package. (As far as I know, this has not caused yet caused issues.) If meta-data is likely to change in the new future, I recommend we hold off registering the interface package.

This sounds fine for JuliaGPs -- we'll just need to ensure that we get the metadata sorted prior to registering.

ablaom commented 2 years ago

I think it might better for us to leave this as unknown and rely on making users aware of the MLJ interface via other channels (on JuliaGP git repos etc)

Fair enough.

mutable struct MyModel{Tv<:Vector{<:Real}} v::Tv end

No, unless you implement getproperty/set property interface to access individual elements, you will only be able to mutate as a contingent object. So, for example, you could, specify a range for TunedModel that looks like this:

r = range(my_model, :v, values=[[1.0, 1.5, 1.6], [2.0, 1.7, 1.8]])

which would optimise over just two alternative sets of hyperparameters stored in v. So it's possible, but clunky.

No, the marginals of the joint are just...

Looks like this is no longer an issue, thanks to your convincing test. Thanks for your patience on this one.

we'll just need to ensure that we get the metadata sorted prior to registering.

Let me know if I can be of further help on this, and when you'd like me to go have another look at what's there.

By the way, a good integration test is something like

using MLJBase
model = MyModel()
X, y = make_regression()      # or make_blobs() in case of classification
evaluate(my_model, X, y)
ablaom commented 2 years ago

@willtebbutt Not heard anything in a little while. Do you need anything further from my end?

willtebbutt commented 2 years ago

Sorry, no, nothing extra needed on your end. Work on this has just stalled at my end because I'm up to my eyes in PhD work. It's definitely still on JuliaGP's agenda though.