JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.78k stars 156 forks source link

Recent proposal for design of package interface #16

Closed ablaom closed 5 years ago

ablaom commented 5 years ago

Please provide any new feedback on the proposed glue-code specification below. @fkiraly has posted some comments here. It would be helpful also to have reactions to the two bold items below.

I will probably move the “update” instructions for the fit2 method to model hyperparameters, leaving keyword arguments for package-specific features (not so many use cases). It will be simplified, made into an argument-mutating function without data as arguments. (If data really needs to be revisited, a reference to it can be passed via cache.) The document will explain use cases for this better.

I will require all Model field types to be concrete.

Immutable models. To improve performance, @tlienart has recommended making models immutable. Mutable models are more convenient because they avoid the need to implement a copy function, and you can make a function (eg, loss) a hyperparameter (because you don't need to copy it). The first annoyance can be dealt with (mostly) with a macro. To deal with the second you replace a functions with concrete type ("reference") and use type dispatch within fit to get the function you actually want. Or something like that. In particular, you need to know ahead of time what functions you might want to implement. For unity, we might want to prescribe this part of the abstraction (for common loss functions, optimisers, metrics, etc) ourselves (or borrow from an existing library).

When I wrote my flux interface for Koala I found it very convenient to use a function as a hyperparameter to generate the desired architecture, essentially because a "model" in flux is a function. (I suppose one could (should?) encode the architecture a la Onnx or similar).

My vote is to keep Models mutable to make it more convenient for package interfaces writers and because I'm guessing the performance drawbacks are small, However, others may have a more informed opinion than I do. For what it is worth, Scikitlearn.jl has mutable models.

What do others think about making models immutable?

Defaults for hyperparmaters ranges. Is there a desire for interfaces to prescribe a range (and scale type) for hyperparmaters, in addition to default values? (To address one of @fkiraly comments, default values and types of parameters are already exposed to MLJ through the package interface's model definition.)

fkiraly commented 5 years ago

What do others think about making models immutable?

I’d agree for them to be kept mutable - I’m not Julia literate enough to see all consequences, but I have a feeling that the various proposed interfaces, including the hyperparameter interface, would not work well with immutable models or may require brittle (and potentially obscure) workarounds such as referencing/macros.

Defaults for hyperparmaters ranges. Is there a desire for interfaces to prescribe a range (and scale type) for hyperparmaters, in addition to default values?

Personally, I don’t feel this needs to be in the MVP/core design if there is error capture and solid evaluation – sth outside the “real range” would simply be caught as error or badly performing setting.

For something more involved, I’m copying part of one of my e-mails from earlier discussion with Diego, which is a more involved (but potentially fiddly) proposal. As said, would not prioritize but just for consideration.

From discussion on hyperparameter abstraction having thought about the parameter set interface, I’ve understood that the distinction between grids, ranges, and single parameters is, on the mathematical side, largely artificial since these are all three sets – just one may be infinite, and the others have to be single-element sets.

Julia’s dispatch offers a very interesting solution to this:

We could to have a single dispatch (or flag) structure of parameter set structs, where the ordering is single-element-set < discrete-set < set.

There’s further the special case of Cartesian products, and single-parameter (not element) ranges, which can be any of these. Cartesian products can be taken over multiple parameter set structs, and end up in the highest class they are taken over.

So what I called “ParamSet, ParamRange, ParamGrid” are actually instances of the same type structure, let’s call the top element of it “ParamSet”. Though I’m not sure how much of this to encode in type dispatch order, and how much through flags and nesting.

But in this view, “makeGrid” would be easy: it takes a set, and makes it a discrete-set, which than can be dispatched upon by “getNumOfPts” and “getKthGridPt”. As signatures go, makeGrid : ParamSet -> discreteParamSet getNumofPts : discreteParamSet -> integer, getKthGridPt : discreteParamSet x integer -> singleParamSet

Kindly let me know if you have any questions – does this make sense?

ablaom commented 5 years ago

I have just realised that it may important to be able to make copies of models (containers for hyper parameters). If I want a composite model to automatically suppress unnecessary retraining of a component model, then the question of whether or not a component needs retraining will depend on what hyperparameters of the composite have changed (i.e. which sub-hyperparameters have changed). However, to compare the previous value of model with the old one, I need to make a copy of the previous one (i.e. not just pass a reference whose target will have changed!). The alternative is to not automate the suppression of retraining but allow the control "freezing" or "unfreezing" of component models via additional hyperparameters of the composite model. One might view this as a dangerous proposition, for one would then need to know more than just the hyperparameters to know what training has actually been carried out; one would need to know the complete sequence of hyperparameter settings.

fkiraly commented 5 years ago

I'm not sure why you think mutating hyper-parameters cannot be avoided? Can you maybe explain precisely the issue because I perhaps don't understand it fully.

In general, the "stylized" model interface design I often have in mind will have hyper-parameters which do not change when fitting (e.g., regularization constant in ridge regression), and model parameters which do (e.g., coefficients of the linear functional). Conversely, the user can manually set hyper-parameters from the outside, but not model parameters. The distinction is not mathematically justified, but purely an interface convention, as above.

In the simplest instance of "learning networks", namely, grid-tuned hyper-parameters, the separation can still be maintained as follows:

This requires encapsulating the two kinds of parameters in the first-order operation though - which is why a nice explicit hyper-parameter interface would be nicer than a non-explicit usage convention, in my opinion.

tlienart commented 5 years ago

FWIW, this is what I mentioned to you by PM @ablaom, I agree with what I believe @fkiraly is suggesting. One way of maybe doing this is to indeed have mutable models but where some of the attributes are immutable and represent (e.g.) regularization. Here's e.g. what I had written for a draft generalised linear regression:

mutable struct GeneralizedLinearRegression <: RegressionModel
    loss::Loss
    penalty::Penalty
    fit_intercept::Bool
    n_features::Union{Void, Int}
    intercept::Union{Void, Real}
    coefs::Union{Void, AbstractVector{Real}}
end

The particulars are not very important but what is maybe of interest is that loss and penalty are immutable types, not functions, this is (I believe) more "Julia"-like on top of providing this encapsulation that I believe @fkiraly is talking about above.

Of course there's always the ambiguity that you could in theory change the loss function after such a model has been fitted. If that's important, you could use (maybe uglier) Ref to make this harder:

struct GeneralizedLinearRegression <: RegressionModel
    loss::Loss
    penalty::Penalty
    fit_intercept::Ref{Bool}
    n_features::Ref{Int}
    intercept::Ref{<:Real}
    coefs::Ref{<:Real}
end

(as an aside, for those who may not be familiar with Julia:)

struct Bar
    val::Ref{<:Real}
end
Bar(v::T) where T<:Real = Bar(Ref(v))
fit!(b::Bar) = (b.val.x = 0.5)
b = Bar(1.0)
fit!(b) # b.val.x == 0.5
fkiraly commented 5 years ago

@tlienart, yes, I think we mean the same, but just to make myself clear what I've meant. In the example above, there would have to be two cases:

(a) the linear regression model is not tuned. In this case loss, penalty, fit_intercept, and n_features cannot be changed by the model, they can be set by the user at initialization of model ("hyper-parameters"). On the other hand, intercept and coefs are set by the model's "fit" method, given data. (b) the linear regression model is tuned(e.g., by grid-search), within a learning network. If all parameters are fully tuned, all six parameters are model parameters and cannot be set by the user at initialization. Instead, they are determined through fitting. The model has no hyper-parameters (or only ones for how the tuning is done).

For the distinction being possible, I think there needs to be an abstraction which tells you which fields of the struct are of which kind. The way it is in your code, the tuning method, any workflow abstraction, or the user, would not know how "coefs" is different from "penalty" or "fit_intercept". But, in my opinion, these are clearly different kinds of parameters (as outlined above).

tlienart commented 5 years ago

(Edited quite a bit, the Ref thing is unnecessary) In fact something that may combine what I understand to be @fkiraly's idea and what I'm suggesting above is something like this:

mutable struct LearnedParameter{T}
    p::T
end
struct Model
    x::Int
    c::LearnedParameter{Vector{T}} where T <: Real
end
m = Model(1, LearnedParameter(randn(5)))
fit!(m::Model, v::Vector{<:Real}) = (m.c.p = v; m)
fit!(m, [1.0, 2.0, 3.0])
ablaom commented 5 years ago

There seems to be quite a bit of confusion about my question "Should models be mutable?" for I don't think we all mean the same thing by "model". My apologies for any part of the confusion.

According to my definitions, a model is a container for hyperparameters only (things like regularisation). Taking a purely practical point-of-view, a hyperparameter is something I can pass to an external package's fit method (e.g., regularisation); parameters learned by the package algorithm (e.g., coefficients in a linear model or weights in a neural network) are not hyperparameters. They will form part of what has been called the fit-result, on which we dispatch our MLJ predict method; but the details of what is inside a fit-result is generally not exposed to MLJ. (As an aside, if one really does want access to internals, the way to do this is return the desired information (e.g., coefficients) in the report dictionary returned by the package interface fit method, as I explain in the spec.)

As @fkiraly points out, in a tuning (meta)model that we construct within MLJ, the hyperparameters of a model now take on the role of learned parameters.

With this clarification, should models be mutable? And, should I be allowed to have functions (e.g., loss) as hyperparameters (i.e. fields of a model).

It is my feeling that we should make it very easy for someone to write an MLJ interface for their package. Ideally, they shouldn't need to understand a bunch of conventions about how to represent stuff or be familiar with this or that library. So, I'm inclined to say that they can make any object they like a hyperparameter, provided they are able to implement a copy function and an == function for the model type. I think we need this; see here. This said, I can't now see why we shouldn't make models immutable, except for the extra hassle for implementing tuning. But I admit, I am still a bit nervous about doing so.

tlienart commented 5 years ago

I think to clarify the confusion it'd be good to have a barebone API (either mutable or immutable), + where are the learned parameters stored if not in the model itself?

In the example for the DecisionTree it's indeed just hyperparameters however this may not be the case in for instance a generalised linear regression model where you'd want to have one container for different regressions (e.g. Ridge, Lasso, ... ) and not one container per specific regression. In that case you may have a mix of hyperparams and also params that actually define what the model is.

ablaom commented 5 years ago

I would call these hyperparameters also. I guess your point is that some of the fields of Model might not be changed in tuning, only at some higher level, like benchmarking/model selection or whatever?

ablaom commented 5 years ago

@fkiraly If I want to check if a Model instance has really changed, it needs to be immutable in Julia, unless I want to overload the default == method:

mutable struct Foo
    x
end

f = Foo(3)
g = Foo(4)
g.x = 3
f == g # false

And even:

Foo(3) == Foo(3) # false

While,

struct Bar
    x
end

f = Bar(3)
g = Bar(3)
f == g # true
f ===g # true

(The last result shows that f and g are indistinguishable, in the sense that no program could tell them apart.)

fkiraly commented 5 years ago

@ablaom @tlienart I think we have two different designs here for the modelling strategy and the fitted model, so we need to think carefully - a decision will have to be made for at most one of these.

One is: (a) the modelling strategy is a struct which contains parameters and hyperparameters. Fitting mutates the struct, but only the parameters ("LearnedParameters" in thibault's design), which jointly encode the fitted model. (b) the modelling strategy is a struct which contains only hyperparameters. Fitting produces a fitted model, distinct from the model struct.

I can see pros/cons of either: (a) makes it easier to keep the fitted model in one place with the modelling strategy specification it came from. (b) makes it easier to fit the same modelling strategy to different data without overcopying the specification.

fkiraly commented 5 years ago

On a tangential note: assume we were to do something similar to keras, where in fitting the component models can update each other sequentially and multiple times, e.g., through backprop being a meta-algorithm applied to interconnected GLM. Which of the two designs would work better? We may want to rapidly update the fits in a specific sequence which also needs access to the specs.

ablaom commented 5 years ago

My experience in Julia is that it is a bad idea to inseparably fuse together data structures that have different functionality - in this case the model strategy (hyperparameters) and the learned parameters. Indeed, in my first attempt at Koala I did exactly this and lived to regret it.

In my vision these two are separate at the ground (i.e. package interface) level but come together at an intermediate level of abstraction in a "trainable model" which combines:

This is pretty much the specification of Learner in MLR3, incidentally, without the cache.

When you call fit! on a trainable model, you call it on the rows of data you want to train and the lower-level fit/update methods create or update the fit-result (and cache).

fkiraly commented 5 years ago

Fair enough - but just to iterate this point, is it correct that, in consequence, you reject both the design of @tlienart and your own earlier one, from Koala?

I have no strong opinion as long as it is consistently done, but as said, it is one of those early design decisions which one in the worst case comes to regret so are worth thinking carefully about...

tlienart commented 5 years ago

I may be a bit dense but I don't see the problems / difficulties, having (abstract) code exposing issues would be a great plus to fix (my) ideas.

At the origin, my thinking was that 1 model = 1 function that acts as a proxy for some function you care about but don't have access to ("nature"). That justified (IMO) having characterising elements (immutable) and values for the parameters. Then you can just apply that object as you would a function to predict.

struct Model1
  hparam::HyperParameter{...} # e.g. what loss, what penalty + their parameters, how / when to prune, tree depth, ....
  param::LearnedParameter{...} # e.g. regression coefficients, tree splits
end
(::Model1)(X) = .... # effectively "compute the function on X" or "predict"
fit!(m::Model1, X, y) = ... # update m.param via refs.

Anyway that much I imagine is clear to you. In terms of hyperparameter tuning, there's no real problem. To each hyperparameter to check (e.g. from a gridsearch), one such model is created, fitted, and kept if necessary. Composing such structures into meta models is also easy afaict. It also seems advantageous that this whole structure is pretty simple compared to effectively doubling up all structures (e.g. for a regression there would be one container for the hyperparameters etc and one container for the regression coefficient (?)). I also imagine that this could hurt you in the backprop / graphical model setting you're talking about franz but again maybe I just don't clearly see the proposal that @ablaom is suggesting).

I'm also not sure I see point (b) of what you suggest in your earlier summary @fkiraly, if you have multiple dispatch with fit! function that go from very broad ("fallback" methods, e.g. LBFGS) to very specific (e.g.: "analytic" for ridge) then applying the same strategy effectively just means calling fit! with the same parameters and just update the learned parameters. There's also something to be said about having a "starting-point model" that gets iteratively fitted as data comes along, the learned parameters then effectively act as the "cache". But again maybe I'm too fixed on a simple understanding of the problems, I think seeing some elementary code of some of the issues that are discussed above by @ablaom would make it clearer why "my" idea might in fact not make sense or just not do the job.

Also, like @fkiraly, I'm actually not hellbound on this, perfectly happy to go with one idea and just stick with it, to be fair I just don't really understand the problems so it's more a question of trying to understand that.

fkiraly commented 5 years ago

Maybe it would be helpful to understand what went wrong for you @ablaom with design (a) ?

Re. "I don't see (b)" for @tlienart : the situation is in which you want to fit the model with the same hyperparameter settings to many different datasets or data views. In this situation, you have a single model specification container, but many fitted model containers.

fkiraly commented 5 years ago

Btw, there is another argumentI can see in favour of (b): it follows the design principle of separating "instruction" from "result", if you were to equate the two with the model strategy specification and the fitted model. Of course it makes sense to have the "result" point to the "instruction".

And here's one in favour of (a): in (b), how would you easily update a fitted model with regards to hyper-parameters, without losing the reference to the instructions it arose from?

Another one in favour of (a): it follows more closely the "learning machine" idea where the fitted model is interpreted as a "model of the world" which the AI (the model/class etc) has. Note that this interpretation is in contradiction with the design which sees fitted model being a "result" - instead it sees it as a "state".

fkiraly commented 5 years ago

So ... is it true that advanced operations such as update, inference, adaptation, or composition, happen more naturally in design (a)? I feel @ablaom is most likely to contradict so I would be keen to hear counterarguments.

ablaom commented 5 years ago

The one model - one function paradigm is not ideal. A transformer (e.g., NN encoder-decoder) has two methods: transform and inverse-transform. Also, we might want to consider classifiers as having a predict method and a predict_proba method. And a resampling method might predict a mean performance, or a standard error, and so forth. That is, multiple methods may need to dispatch from the same fitted model. This complicates learning network design because you may have different nodes implementing different methods on the same fit-result (e,g, transforming the target for training a model and inverse_transforming the predictions that model).

Whether we should adopt design (a) or design (b) at the level of glue-code, practical concerns make (b) the clear choice: We can always combine parameters and hyperparameters at a higher level of abstraction (which is what I intend and sketch above) but we can never unfuse them if they are joined already at in the the bottom-level abstraction.

@fkiraly At present I would allow the user to mutate the hyperparameter part of a combined "machine" and then call fit! (without specifying hyper parameters) to update the fit-result (learned parameters). However, this means that before the fit call, the hyperparameters and learned parameters are not yet in sync. Although there is potential for trouble this is very convenient. What do you think?

fkiraly commented 5 years ago

@ablaom , I think there is an important difference between "one model - one function" and "one model - one container".

The first ("1 model - 1 function") is arguably silly and if I understand it no one in this thread would want it. We may need to dispatch on fit, predict, trafo, backtrafo, predict_proba, what_are_my_hyperparameters, predict_supercalifragilistic and similar, but at least on fit/predict, which is 2>1.

The distinction between (a) and (b) is where it stands on the "one model - one container" issue, I think both designs agree with "one model - many interface methods" as is in my opinion reasonable.

Also, (b) is not so clear cut - even if you have a single container, you can always write multiple accessor methods, say instructions_of(container) and fitted_model_in(container)

I still don't have a strong preference, just playing the devil's advocate and pointing out what I see as a gap in reasoning.

fkiraly commented 5 years ago

More precisely, I don't think the situation is that clear: whenever you have a collection of (dispatch-on or oop) objects and class methods, there's always the decision on a spectrum of tacking things together and taking them apart. Axes on which to consider these are user friendliness, clarity of code, and the semantic/operational sensibility of tacking the things together or not.

For model instructions and fitted models this is not so clear to me: the first can live almost entirely without data, and the second may be invoked multiple times for a given instruction set ("hyper-parameters"). Python/sklearn and R-base handle the issue differently and I can see the merits of both.

ablaom commented 5 years ago

The design has solidified considerably since this discussion and I am closing the issue.