JuliaAI / MLJBase.jl

Core functionality for the MLJ machine learning framework
MIT License
160 stars 46 forks source link

Provide a way to get test set predictions from `evaluate` #837

Open ericphanson opened 1 year ago

ericphanson commented 1 year ago

Like https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html (pointed out by @josephsdavid!)

Currently, I am doing it manually, which works fine:

X = DataFrame(df.features)
y = df.label

stratified_cv = StratifiedCV(; nfolds=6,
                                 shuffle=true,
                                 rng=StableRNG(123))

tt_pairs = MLJBase.train_test_pairs(stratified_cv, 1:nrow(X), y)

cv = []
predictions = DataFrame()
for (train_indices, test_indices) in tt_pairs
        model = ...
        mach = machine(model, X[train_indices, :], y[train_indices])
        MLJ.fit!(mach)

        push!(cv, (; machine=mach, train_indices, test_indices))

        ŷ = MLJ.predict(mach, X[test_indices, :])

        append!(predictions, hcat(df[test_indices, :], DataFrame(:prediction => ŷ)))

end

It would be nice if evaluate could give the predictions as well, since it needs to generate them anyway.

ablaom commented 1 year ago

Thanks @ericphanson for flagging this. There was a request for this a while ago by @CameronBieganek, but I can't find it just now.

Sometimes this might introduce scaling issues, for large datasets, in particular ones with multi-targets (think of time-series, for example), which becomes worse if we are doing nested resampling, as in evaluating a TunedModel. So probably including predictions in output to evaluate should be an option. Or, like sk-learn, we could have a separate function?

Another minor issue, is which "prediction" to return, or whether to return more than one kind. For a probabilistic predictor, some metrics will require predict_mode (or predict_mean/predict_median) and some just predict. Exposing the output of predict makes the most sense, but I think it's possible for the user to limit operations to, say, just predict_mode, so that predict is not actually ever called. Probably the simplest design is to force the predict call anyway (if our return-predictions option is on) and always return that?

The function where all this is happening, which will need to add the desired predictions to it's return value is here.

ericphanson commented 1 year ago

I am not very familiar with the predict_* functions; is it ever more than just post-processing predict? Anyway, I do see operations is passed into evaluate! so maybe that can determine what kind of predictions you get back?

It sounds like the most straightforward approach is to add a return_predictions keyword arg that if true, we add an extra table w/ something like row index and prediction to the output object.

However that kind of design always feels like perhaps we aren't "inverting control to the caller" and that a more compositional flow might be better overall. E.g. I could imagine evaluate being implemented as the simple composition of training over folds, predicting over folds, and evaluating those w/ metrics, and exposing each layer with an API function.

ablaom commented 1 year ago

However that kind of design always feels like perhaps we aren't "inverting control to the caller" and that a more compositional flow might be better overall. E.g. I could imagine evaluate being implemented as the simple composition of training over folds, predicting over folds, and evaluating those w/ metrics, and exposing each layer with an API function.

Yes, a compositional approach sounds better. I probably don't have the bandwidth for that kind of a refactor but if someone else was interested...

ablaom commented 1 year ago

I'm curious, what is your use case for collecting the out-of-sample predictons? Are you doing some kind of model stacking perhaps? We have do have Stack for that.

ericphanson commented 1 year ago

No, I just want to do my own evaluation on the predictions. In this case, I have multichannel data, and my model is trained to work on each channel independently. But in addition to the evaluation on that task, I want to also combine predictions over channels and then evaluate the aggregated results. I could probably do this by formulating a new composite model (I think?) but if I could just get the predictions directly, I can do whatever evaluation I want.

I have also come across this need other times, e.g. I want to plot prediction vs label for my whole dataset (can be important if you don't have a lot of data). CV lets you get useful predictions for all data points, even if there are really n_folds different models supplying them.

Another case can be if you want to evaluate on different stratifications of the data. E.g. what if I wanted to know how my performance varies by channel (on models trained on all channels- I don't want to move one channel all to the test set, e.g.). If I have all the predictions it's easy to do any kind of evaluation needed.

BenjaminDoran commented 1 year ago

Just wanted to add that I would also find it very helpful to be able to access the out-of-fold predictions from evaluate for the same reasons listed by Eric.

ablaom commented 1 month ago

Just a note that this is more doable now that we have a separate PerformanceEvaluation and CompactPerformanceEvaluation types. Target predictions could be recorded in the first case but dropped in the second. A kwarg compact controls which is returned by evaluate!.