Integrating online and active learning models

KnutJaegersberg commented 5 years ago

Integrating OnlineStats (its online learning algorithms) and giving it an easy to use hyperparameter tuning context makes Julia even more useful for quick ML on real big data.

ablaom commented 5 years ago

Sorry, but is this a comment or feature request?

fkiraly commented 5 years ago

I believe it's both?

Generally, on-line learning is quite a relevant and important area. For a package to support on-line learning properly, it needs to support: (i) sequential data streams (where data may be i.i.d.) (ii) an on-line update, i.e., update model when new data comes in.

Parallelization and distributed computations are separate features that are nice on their own, but are quite synergistic.

As far as I can see, OnlineStats supports (i) sequential data streams, and (ii) updating through its fit method, as well as some simple parallelism, through its interface design, which is very nice.

I see two main blockers for interfacing:

there's no explicit hyper-parameter interface
mlj has no explicit design for the on-line task which is more complicated than the simple supervised task.

Point 1 is straightforward to solve, though obviously it's work (and maybe best done by the onlinestats folks?).

Regarding point 2, this is more subtle: for interface hygiene, I don't like the design decision of onlinestats that fitting is always updating. I'd rather separate "fit" and "update", clearly distinguishing "first-time fitting" and "updating". This would, i.m.o., also make a lot of sense with Bayesian models, for the Bayesian update - Bayesian models are often automatically on-line (but not necessarily on sequential data streams as the stylized ML on-line setting).

Any thoughts?

Though generally, I wouldn't see supporting the on-line modelling task a priority above "getting mlj core working", obviously.

jpsamaroo commented 4 years ago

As mentioned in #71, I'm interested in adding support for active learning to MLJ. My usecase is training models on real-time data like microphone or camera input (and outputting the model's reaction to actuators/devices in real-time).

@fkiraly can you give a concrete example of how you would split OnlineStats' fit method into two components? I'm not clear on how or why that's beneficial from your comment alone, since OnlineStats "models" usually do very little during their fit call.

jpsamaroo commented 4 years ago

Bump. Can someone provide me an example of what they'd like the online learning API to look like so that I can build out to needed code/interfaces to support this feature?

ablaom commented 4 years ago

Thanks @jpsamaroo for re-pinging this discussion and for the offer to help.

For clarity, here's my understanding of basic online learning: A supervised or unsupervised machine learning algorithm that has already been trained on some data X is supplied with new data Xnew and is retrained:

(i) as if it the training data were was X and Xnew combined, but without the algorithm needing access to the previous training data X; and

(ii) in a time approximating the time required to train on Xnew alone.

In some cases the learned state based on "train with X and update with Xnew" is not actually the same as the state based on "train with X and Xtrain together", but it is a useful approximation.

Not all machine learning algorithms directly support online learning.

Basic work-flow

Here's how I see the basic work-flow for training and updating an MLJ learner. For concreteness, I will suppose the learner is unsupervised, in this case a PCA model for dimension reduction.

X = MLJ.table(rand(1000, 17))

# initialize and train on first batch:
model = @load PCA
mach = machine(model, X)
fit!(mach)

# fit on second batch of data:
Xnew= MLJ.table(rand(10, 17))
inject!(mach, Xnew)
fit!(mach)

When new data is injected into a machine, the machine updates an internal count of the number of injections. When this is one or more, the next call to fit! calls update_data(model, ...) instead of fit(model,...) or update(model...) (for updates triggered by hyperparameter changes, such as increasing an iteration count).

Composing online learners

If a learner does not support online learning, then I suggest the effect of the update be "leave machine unchanged" and "issue warning, if this is the first update". In that way, if a learning network contains both online and non-online models, then the overall "online" learning network continues to have utility, and can be exported (blueprinted) to generate a new online model type.

An alternative is that updating a non-online learner with new data, and fitting, actually retrains the learner from scratch on just the new data. This is more complicated to deal with because, in the common use case (train on the first batch of data and leave alone), we would need extra interface points for freezing non-online components once trained. The advantage would be that we could, in principle, also unfreeze these components to "re-calibrate" the new non-online elements. Is there a substantial use case for this?

We will need syntax for the learning networks. It would look like this:

Xs = source(X)

mach = machine(model, Xs)
Xout = transform(mach, Xs)

# fit on first batch of data:
fit!(Xout) 

# add data and update:
inject!(Xs, Xnew)
fit!(Xout)

Implementation

In brief, to implement the above just requires:

[x] Add a method stub for online_update(model::Model, ...) to MLJBase. This method supplements the existing fit and update for models for models.
[x] Add new model trait supports_online to MLJBase
[ ] Give Machine and NodalMachine an n_injections field
[ ] Add logic to fit! to determined when to call online_update
[ ] Add new inject! methods

The more difficult design decisions revolve around deployment, tuning and control. Unlike control of, say, a neural network ("train until the error stops decreasing" or whatever) control of an online learner in deployment is driven by events outside of MLJ. What's the best way to do this in julia?

That said, the framework should be similar to that suggested in Model wrapper for controlling iterative models or a single wrapper could be used for both, as @fkiraly has suggested.

The pragmatic way to move forward which I would advocate, given current resources, would to implement the basics outlined above, and test on some examples, flushing out the other design issues later.

Thoughts anyone?

In terms of implementing the basics, I expect it is best that I take this up. However, help with implementing online/iterative method control would be greatly appreciated. In addition to the design outlined in the issue, I have more detailed sketches for the iterative control wrapper that I can share.

Oblynx commented 4 years ago

I'm developing an online unsupervised learning model for timeseries, which can do prediction / anomaly detection when coupled with a supervised model. As I'm looking for a standardized interface I'm thinking to experiment with MLJ. This can be a use case coupling this issue with #303 and #51 . I mention it just as food for thought at the moment.

ablaom commented 4 years ago

Thanks for that. It might be a challenge to introduce time series and online learning to MLJ simultaneously but all help and input welcome.

On the time series front, see also #303 (continuing time-series related discussion there) and https://github.com/alan-turing-institute/ScientificTypes.jl/issues/14 .

cscherrer commented 4 years ago

To generalize this a bit from a discussion with @ablaom on Slack, it seems like there are at least four different cases to consider:

Change the model itself, for example warm restart after changing a hyperparameter
Update model fit, with no change to the data
Update model fit based on a change to the observations
Update model fit based on a change to the features

For (4), lots of statistical models can be fit in terms of sufficient statistics. If we add or remove features, there are often ways to efficiently update those sufficient statistics without starting from scratch.

For example, say we have a linear model with squared loss (and maybe some arbitrary regularization). This can be fit using a Cholesky decomposition of X' * X. If we add a feature, we may have some way to make update the Cholesky, rather than recomputing the decomposition.

In addition, in this situation we'd want to be able to use a previous model fit as a starting point, maybe just starting the weight for the new feature at zero.

ExpandingMan commented 1 year ago

I've recently come up with a workaround for this feature in which I update an xgboost model by defining

MLJBase.fit!(m::Machine, X, y)

and I've spent a bit of time considering whether this can be generalized.

For the cases that @cscherrer laid out above, I think 1,2,3 should be relatively easy (for models where they are possible at all) while 4 is likely to be very hard.

I'll summarize some of the thoughts I've had about a fit!(m, X, y) pattern:

We'd have to drop any guarantee that all training data is stored in the machine. You could append it in principle, but in practice if somebody is doing online learning it's likely because they couldn't fit the entire dataset in memory. The most you could do is record how many data points have been used in training.
In principle any hyperparameters governing the updates could be put into the model objects, but this might not always be great for the underlying model's interface particularly if that interface expects arguments with each training batch.
Every Node in a network would have to decide what to do on repeated calls to fit!. I think by default we'd need to have new calls to fit! being a no-op and then as part of the model interface there can be methods for it. I don't think there's anything too tricky here since presumably on repeated calls each Node would get exactly the same format it got during initial training.

Something like this seems like it would be easier than @ablaom 's inject! above, since we wouldn't have to worry about what the machine does with the injected data (i.e. it would have to store it between calls to inject! and fit!.

Thoughts?

ablaom commented 1 year ago

The syntax fit!(mach, X, y) sounds like a good suggestion - we probably don't need to separately attach new data to the machine and then train. However, I can't see how it is possible to implement incremental learning purely at the machine level. Don't we need a method in the model API that tells us how to add data (without discarding learned parameters)? After all, not all models can do this. (Perhaps there is some confusion about MLJModelInterface.update. This is not a method to add data, only to respond to changes in hyper-parameters (eg, iteration parameter) that needn't trigger a cold restart.)

ExpandingMan commented 1 year ago

Don't we need a method in the model API that tells us how to add data (without discarding learned parameters)?

That's why I think the Machine interface makes this a lot more complicated than it is for most of the models themselves. Most models already implement something like fit!(model, X, y)... it seems a pretty safe bet that in the vast majority of cases you will just have something like

fit!(mach::Machine, X, y) = fit!(mach.fitresult, X, y)

I'm not entirely sure what you mean but I think your concern is that the existing definition of Machine is basically model plus data. Adding the ability to do fit!(mach, X, y) means the machine is just a wrapper of the model, not necessarily the data. Of course models would have to define some kind of fit!(model, X, y) method for this to work, I was not implying that it would not be a new method.

I don't really see any way around this: it's not realistic to always require that all the data is kept. If you have an entire network you can have fit!(mach, X, y) recursively call the same thing on all the nodes with the ones that don't implement it defaulting to a no-op (though I haven't fully thought this through, it might be dangerous if some models should update but don't).

So TL;DR my suggestion was that models would be required to implement something like fit!(model, X, y) to be able to get online updates and that this is the method that would update model parameters without completely resetting. This would have the virtue of being very easy to implement on must models that can support it.

ablaom commented 1 year ago

So TL;DR my suggestion was that models would be required to implement something like fit!(model, X, y)

Yeah, we have already have the stub (see above comment):

MLJModelInterface.online_update(model::Model, fitresult, verbosity, new_data...) -> (fitresult, state, report)

We just don't have any models that implement it. (And I don't like the name anymore - I'm using ingest! in a planned revamp of the interface, and allow it to optionally mutate fitresult).

We could additionally:

Add a field to machines n_ingestions to machines, to count number of new data injections (user needs to know if learned parameters are based on more than the current data attached to machine)
Exend signature of fit! to fit!(mach::Machine, newdata...)
fit!(mach, newdata...) calls replaces data attached to mach with newdata, increments n_ingestions, and calls dispatches the model training method ingest (instead of fit or update) whenever newdata is non-empty.

How's that sound?

One question is whether this could play nicely with model composition. That might be quite tricky, and I will have to think about it some more.

JuliaAI / MLJ.jl

Integrating online and active learning models #60