Combined multi-task models

PLA-8517

The current approach to multitask learning is to choose the decision tree splits once by taking all labels into account, but return a sequence of models, one for each label. Because these models are independent objects, we lose the ability to potentially compute uncertainty correlation information. The goal of this PR is to expose a getUncertaintyCorrelation method in a type-sensible way.

Unfortunately, the existing paradigm around multitask learning assumes that the learner outputs one model for each label. There are several places where this is enforced, either explicitly or subtly (and an exception will occur if this requirement is not met).

The approach I've tried here is to create a parallel series of objects, starting with MultiTaskCombinedLearner, which is a spiritual peer to MultiTaskLearner, but instead of training outputting a Seq[TrainingResult], it outputs a single MultiModelTrainingResult. This yields a MultiModel, which makes predictions that are MultiModelPredictionResult. The key type enforcement is that MultiModelPredictionResult predicts Seq[Any]. This allows us to include some information and methods in MultiModel that reflect the fact that predictions correspond to multiple labels.

With this type structure, the getUncertaintyCorrelation method is sensible for MultiModelPredictionResult, and any training that starts with a MultiTaskCombinedLearner is guaranteed to result in predictions for which the getUncertaintyCorrelation method applies (although it might return None). The method takes two arguments, indices i and j, and returns an optional sequence of correlation coefficients for each prediction.

A major drawback of this approach is that it creates parallel sets of objects that share much of their logic with existing objects, even if they're not connected. Most of the shared logic is abstracted into mix-in traits, specifically AbstractMultiTaskBagger and MultiTaskTree, so there's little repetition of logic. But it creates boilerplate and doesn't scale easily. If we want the Standardizer or FeatureRotator learners to be applicable to MultiTaskCombinedLearner, then we'll have to create new versions of the learner/training result/model/prediction result objects. There will be very little logic in these objects, but it's clunky.

Unfortunately, since the "multitask learners produce one model per label" assumption is pervasive, I'm finding it difficult to come up with a structure that doesn't result in so much boilerplate. For example, MultiTaskStandardizer makes this assumption when it zips the results of baseLearner.train together with the outputs.

Outstanding question:

Is making a parallel series of objects for multi-task combined learning the right thing to do, or is there a way to reorganize the code to remove the "one model per label" assumption?
Is this the right interface for getUncertaintyCorrelation()?
It seems to me that the bias correction models are not currently applied in any situations. I think I fixed this bug, but it should be verified.

Outstanding issues for this or follow-up PRs:

Implement a non-trivial version of getUncertaintyCorrelation
Implement feature importances for multi-task models (the information exists, but it's buried right now and needs to be pulled up)
Implement predicted-vs-actual for multi-task models
Figure out how to make Standardizer and FeatureRotator work with multi-task combined models (Standardizer is especially important because we want the inputs to be on the same scale when the splits are being calculated).

Hey @bfolie, I can get to this later today after I finish some slides. But I admittedly am not up to speed on this code base, so it may take me some while to leave any helpful remarks.

Unfortunately, the existing paradigm around multitask learning assumes that the learner outputs one model for each label. There are several places where this is enforced, either explicitly or subtly (and an exception will occur if this requirement is not met).

Is making a parallel series of objects for multi-task combined learning the right thing to do, or is there a way to reorganize the code to remove the "one model per label" assumption?

At a high level, removing the one model per label seems like the natural approach to me. This could coincide with an abstraction of MultiTaskLearner to be the parent trait, with MultiTaskIndependentLearner and MultiTaskCorrelatedLearner subtypes for the two cases in question.

And as per my comments below, it seems like much of the other MultiModel, etc types could also follow this hierarchy.

I am unsure how feasible this is, and do recognize it would be a large re-factor. But from a logical perspective, this seems like the right choice to me.

At a high level, removing the one model per label seems like the natural approach to me. This could coincide with an abstraction of MultiTaskLearner to be the parent trait, with MultiTaskIndependentLearner and MultiTaskCorrelatedLearner subtypes for the two cases in question.

And as per my comments below, it seems like much of the other MultiModel, etc types could also follow this hierarchy.

I am unsure how feasible this is, and do recognize it would be a large re-factor. But from a logical perspective, this seems like the right choice to me.

The first thing I tried was to make a single MultiTaskLearner trait and then sub-type it into a "makes N models" trait and a "makes one model" trait. But in that case, the abstraction doesn't get you anything. The only method these learners share is train, but what's the signature? The parameters are clear:

def train(inputs: Seq[Vector[Any]], labels: Seq[Seq[Any]], weights: Option[Seq[Double]] = None)

But what's the return type? Is it Seq[TrainingResult] or TrainingResult? If it's the former, then the "makes one model" trait ends up with implicit logic of the form "we assume that this method returns a sequence of length 1". And there's no shared code to make it worthwhile. But the latter is incompatible with the "makes N models" trait. I suppose in principle one could make no assumptions about which training results correspond to which labels, and include machinery as part of the trait that does the mapping, but that would be needlessly complex. We have no need of a situation in which model 1 predicts the first 3 labels, model 2 predicts the fourth label, and model 3 predicts the fifth and sixth labels. So this is a bad idea.

There's a lot of value in enforcing the interface "train outputs a single TrainingResult", at least for correlated multitask learners. It allows for the correlation method to be exposed, and it's just more specific than a sequence, so there are fewer assumptions in the code.

Another thing that's valuable, and that gets lost by having some multitask models output Seq[TrainingResult] and some output TrainingResult, is composability. The multitask version of the bagger, feature rotator, and feature standardizer learners all take some other multitask learner as a parameter. If we mix and match training behaviors, then that falls apart. Hence the need for a parallel set of classes.

I'm thinking that the best course of action is to have all multitask learners output a single training result. The interface doesn't need to care about the details of how those models are trained or how they make predictions, it just needs to enforce how the information is presented. It's general, because a set of N independent models can easily be wrapped into a single model. And it's a stricter return type than a Seq is. If there's need for N independent models, then we could expose a method, asIndependentModels that spits out N models for N labels.

The downside is that this is a breaking change to the MultiTaskLearner interface. But it wouldn't necessarily be a big logical change. I'll give it a try.

I refactored the existing MultiTaskLearner to return a single MultiTaskTrainingResult after training, instead of a sequence, and I think the result is much better. There's very little duplicated logic between the multitask and singletask learners, and the multitask version of the FeatureRotator and Standardizer now play nicely with the ParallelModels class, which again leads to only a small amount of code duplication.

My biggest concern with this approach is that it threatens to layer an increasingly complicated contraption atop a foundation not designed to bear its weight: for example, what is there to reuse when it comes time to build non-Gaussian uncertainty distributions? ... To me, this PR is a convincing case for restructuring the code on the basis of first-class support for abstract probability distributions as both inputs and outputs: rather than acting on and returning Seq[Any], let models act on and return Distribution objects that may include Gaussian and UncorrelatedGaussian as special cases.

I disagree on several fronts.

With the most recent changes, I don't think that this PR adds an unwieldy layer
I don't think that this work makes non-Gaussian distributions more urgent
I don't think that non-Gaussian distributions need to be supported

For me, putting this together did highlight the issues in Lolo around the use of Any. We lean on it in many places to allow for Double/Char/Boolean, which makes it difficult to build against the interfaces. But that's not a new issue, and if anything I think that this PR is a slight improvement in a few areas, because it is sometimes able to be more explicit about where Doubles are expected and where Sequences are expected.

In my opinion, much of the complication in the first draft was related to the fact that multitask learners were returning sequences of training results, which makes them different from single-task learners, and also obscures an assumption (that one training result corresponds to one label). By refactoring MultiTaskLearner to return a single MultiTaskTrainingResult, I think it slots in fairly well along with the single-task learners.

Regarding non-Gaussian distributions, I don't see those as being a concern. The theory that we have (Wager et al) is focused on estimating the standard error of ensemble predictions, and that's what we report as the uncertainty. That theory can be extended to compute a correlation coefficient between two uncertainties, which we will expose using this new interface. Drawing an error bar or slapping a (multivariate) Gaussian on the prediction is certainly an approximation, but a. It's not an approximation that Lolo is making (software that consumes Lolo may make that approximation) b. We have no good reason, at the moment, to make any other approximation

If we knew that this Gaussian approximation was a significant issue and we had some compelling alternative theory to generate confidence intervals, then we could plan around a more generic Distribution type. But neither of those things are true, so it would be foolish to prematurely try and build for it. At the moment, getUncertainty always returns a Double for real-valued predictions. So I think it's safe and meaningful to make getUncertaintyCorrelation also return a Double. It hardens the contract around current behavior. Switching to distributions would be a complete re-creation. We wouldn't be digging out of the current hole; we'd be teleporting to the top and digging a new hole. So might as well dig our current hole deeper and spruce it up a little.

Let me get the tl;dr out of the way first: I agree re-creation is a separate scope.

I disagree on several fronts.

1. With the most recent changes, I don't think that this PR adds an unwieldy layer

I don't mean to say that this PR is unwieldy. On the contrary I think it does a great job of meeting the codebase where it is, with a new need. Sorry if that came across otherwise. Rather, I am concerned about the long-term effect of building multivariate things that were engineered only to serve univariate cases. As you highlighted, there was code smell associated with introducing parallel interfaces that shared much of the same code; the same is true for the parallel maintenance of learners and multi-task learners. But I admit my opinion about the topic in front of us, which proved pessimistic, was formed before giving a good look at your latest round of changes. Updating the interface with MultiTaskTrainingResult was a big improvement.

2. I don't think that this work makes non-Gaussian distributions more urgent

Agreed.

3. I don't think that non-Gaussian distributions need to be supported

I don't agree. I certainly don't think they need to be supported by this PR but I do consider them a likely target for future work, so I want that to be in mind when we update these areas of the code.

If we knew that this Gaussian approximation was a significant issue and we had some compelling alternative theory to generate confidence intervals, then we could plan around a more generic Distribution type. But neither of those things are true, so it would be foolish to prematurely try and build for it. At the moment, getUncertainty always returns a Double for real-valued predictions.

There is literature that motivates the use of non-Gaussian multivariate predictive distributions, with particular application to small data like we face; one particular model type we may want to consider comes with uncertainty quantification developed from first principles. I'll talk to you about that separately.

A high level comment: I find some of the naming conventions that were already present quite hard to infer. For instance, BaggedMultiResult is a PredictionResult[Double] that does not implement your new MultiTaskModelPredictionResult, whereas MultiTaskBaggedResult does.

I agree. When I first started looking into the existing code, it took me some time to realize that BaggedMultiResult referred to multiple predictions, not multiple labels. I suppose we can rename it now, since we're already making breaking changes.

Any suggestions? BaggedMultiPredictionResult could work, but that could also be interpreted as multiple labels (a single "multi-prediction"). What about SinglePredictionBaggedResult and MultiPredictionBaggedResult?

What about SinglePredictionBaggedResult and MultiPredictionBaggedResult?

I like this option. It makes it clear that BaggedSingleResult and BaggedMultiResult are their own concept, compared to MultiTaskBaggedResult which is a separate case.

CitrineInformatics / lolo

Combined multi-task models #245