JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.78k stars 157 forks source link

Brainstorming: API for meta-data/side-information on the features #480

Open nignatiadis opened 4 years ago

nignatiadis commented 4 years ago

This follows a discussion on Slack and @ablaom's suggestion to open an issue to brainstorm ideas.

The problem Consider a machine learning model with p-dimensional features X. Now assume that for each feature j, the analyst has access to external information Z(j) and furthermore that this information can potentially be used by machine learning models to improve predictive performance. A concrete example of such external information would be the case of categorical Z(j), which induces a partition of features into groups of related features.

How could the MLJ API account for that?

Example use cases for grouping structure

azev77 commented 4 years ago

@nignatiadis this sounds intriguing. Can you help me understand w/ a more precise example?

I study finance a little. Suppose each observation (row) corresponds to a mortgage borrower. X= FICO, Income, House Price, Loan Size, Monthly Payment (all at origination) Y= 1 (if borrower defaulted in two years), =0 (else)

From previous experience (& other studies) I know that:

  1. Lower FICO score borrowers are considered risky, borrowers w/ FICO <620 are particularly risky.
  2. Borrowers w/ DTI:= (Payment)/(Monthly Income) >36% are particularly risky.
  3. Borrowers w/ LTV:=(Loan)/(House Price) >80% are particularly risky.

Q: are these thresholds for FICO, DTI, LTV examples of "co-data" that I can use to help predict Y?

Naturally, I would use raw X to create indicators 1{FICO<620}, ratios (DTI, LTV), and interactions. How is this different from a structured approach to feature engineering?

ablaom commented 4 years ago

I guess a key question from the API point-of-view is what the interface point for the metadata will be.

@nignatiadis Perhaps this metadata be simply passed as a model hyperparameter? This would mean the existing API already supports such models. One thing that bothers me somewhat is having hyperparameters that make explicit reference to characteristics of the data (eg, feature names). This is not disallowed, but I think it makes model composition less flexible, for example.

Another alternative is to pass the metadata along with the input and target (and weights), as in machine(model, X, feature_metadata, y) or machine(model, X, feature_metadata, y, w). In principle, a supervised model can have an arbitrary number of training arguments. In practice there may be complications with type checks (which assume only two X, y or three X, y, w arguments) And in learning networks we would have to add a new kind of source node; we have :input, :target, :weights, and now would add :feature_metadata or whatever.

So I guess this could be done.

ablaom commented 4 years ago

I propose that to justify necessary changes (for either alternative) we need someone willing to contribute a implement MLJ's model interface for some "feature metadata" model. Willing to provide guidance to any such person.

nignatiadis commented 4 years ago

@ablaom that makes a lot of sense. I may contribute a method that uses grouping metadata with a MLJ interface, but am not sure what the timeline will be (still figuring out the math stat theory for it). Here are some thoughts in the mean-time:

Of course, grouping is very simple side-information, so your first alternative may be good enough (and what I had been thinking to implement as a first-pass). A more elaborate interface would be useful once side-information becomes more complicated. But perhaps even with grouping it could be useful to have a more elaborate interface: Say, if you plan to get the groups by discretization of continuous feature meta-data.

There are also some types of "meta-data" that can be computed based on the data: If you run ridge regression, typically a first step is to use a Standardizer. But maybe the variance of each feature is actually informative of its signal. Van de Wiel et al. propose to group features by discretizing based on the variance and then running ridge regression with a different regularization penalty for each group.

I like your second proposal! Two more observation are the following: in a sense MLJ already supports feature metadata through sci-types (but not "instantiated" meta-data). Furthermore, weights could be thought of as subject-level meta-data.

@azev77: I think for the type of domain knowledge you are describing, your approach is the right way to go and not through feature meta-data. However, what we are discussing here could go into your model as follows: You could designate your hand-crafted features as "important" and the rest as "unknown". Then you could fit a penalized regression model that does not penalize the "important" features but only the "unknown" ones (with a data-driven choice of regularization parameter). At a high-level, you can think feature meta-data as data that corresponds to each column, say a p*q matrix, where p is the number of features.

nignatiadis commented 3 years ago

Here is an example package with a MLJ model that partitions features into groups: SigmaRidgeRegression.jl. I will try to add a native Julia group lasso implementation at some point as well.

Following @ablaom's suggestion, the interface point for describing the groups is through a model hyperparameter called groups. This suffices for my use case (though indeed this would be ugly for model composition, but I have not had this need so far).

ablaom commented 3 years ago

@nignatiadis Congratulations on the new package 🎉 This looks like a substantial piece of coding. Very excited for MLJ to add these models!. Appreciate you making the effort to wrap your head around our API!

Since my earlier post there have been a number of changes in a direction that makes my "alternative 2" straight-forward. So, if you think this makes more sense, you drop the group hyperparameter and instead extend your fit signature to fit(model, verbosity, X, y, g). Then the user will construct his machine like machine(model, X, y, g).

It will be up to you to document what form g should take. There is however one caveat: it cannot have any of these types: AbstractVector, AbstractMatrix and it cannot be a table (satisfy Tables.istable(g) == true). For otherwise, MLJ will try to do row-selection on g in resampling, which you don't want. So, you could require g to be a Tuple for example.

The form of X and y (on the user/machine side) will still be constrained by the input_scitype and target_scitype you declare. It sounds like a common use-case includes very large numbers of features. I suggest that you permit the user to provide either a table or a matrix. You do this with something like

MMI.input_scitype(::Type{<:LooSigmaRidgeRegressor}) =
    Union{AbstractMatrix{MMI.Continuous},Table(MMI.Continuous)}

(Your target scitype, for regression, is AbstractVector{Continuous}.) It will be up to your fit method to detect which it is receiving - table or matrix - unless you implement a "data front end", in which case your reformat method takes care of dispatch.

The new optional "data front-end" prevents a lot of unnecessary conversions to tables and back again. This could be bells and whistles you add later however. In any case, happy to provide guidance. Docs Sample implementation.

fkiraly commented 3 years ago

Random comment on this since I'm also interested in the answer in general: in terms of a conceptual model, it might make sense to distinguish

Using domain driven design and the common conventions, it seems more natural to attach the first to the data container - explicitly as part of it, or implicitly as additional information whenever the data is passed (e.g., in fit) - while it seems more natural to pass the second as part of the model specification, e.g., in the constructor or as a hyper-parameter.

Would be interested to hear your opinion, @ablaom, on whether there is such a reasonable distinction, and if yes how to deal with it.

ablaom commented 3 years ago

@fkiraly I agree to make this distinction. In the @nignatiadis use case, the groups are intrinsic to data, in the sense that before seeing the data it is non-sensical to define them. So it makes sense to move them from hyperparameter to "data container", which here just means to the last argument to fit (splatted), as I am now suggesting.

nignatiadis commented 3 years ago

@ablaom that sounds great, I like the new interface point for the groups a lot and will update my package accordingly (I am already using a specialized type to represent the groups that satisfies the properties you mentioned). It is hard to keep up with the stream of amazing (and non-breaking) updates to MLJ! Thank you for all the deep thought and work you put into MLJ!

Also I think it is great that the conversion to table is no longer needed and one can use the matrix directly. Right now my package is doing type piracy to avoid slowdown when the number of features is large (https://github.com/alan-turing-institute/MLJBase.jl/issues/428#issuecomment-708141459). If I understand correctly, I can remove the pirated code now, as long as I make sure to pass matrices instead of tables.

ablaom commented 3 years ago

@nignatiadis Thanks for the positive feedback. Very much appreciated.

I understand correctly, I can remove the pirated code now, as long as I make sure to pass matrices instead of tables.

My preference is that you allow for both tables and matrices. The first, because that is what users have come to expect as allowed input; the second as a (long term) "workaround" for shortcomings in the tables api. So the user can use either, but uses very wide tables at her own risk.

Is that possible?

nignatiadis commented 3 years ago

Yes that is precisely what I meant to do! Keep allowing tables and deal with the case of a very large number of features through the matrix interface point (instead of committing type piracy).

nignatiadis commented 3 years ago

Motivated by @fkiraly's remark, I am also wondering how one would deal with the following case ( this is not something I need right now though ): Suppose the features can be partitioned into K groups. One could then do ridge regression (or LASSO) with a different penalty level λ for the features in each group. In this case, the number of hyperparameters depends on the number of groups. If one then wants to use TunedModel, say, to tune these K λs, it seems more natural to have the (number of) groups as part of the model (but maybe not?).