Closed bfolie closed 3 years ago
Hey @bfolie, I can get to this later today after I finish some slides. But I admittedly am not up to speed on this code base, so it may take me some while to leave any helpful remarks.
Unfortunately, the existing paradigm around multitask learning assumes that the learner outputs one model for each label. There are several places where this is enforced, either explicitly or subtly (and an exception will occur if this requirement is not met).
Is making a parallel series of objects for multi-task combined learning the right thing to do, or is there a way to reorganize the code to remove the "one model per label" assumption?
At a high level, removing the one model per label
seems like the natural approach to me. This could coincide with an abstraction of MultiTaskLearner
to be the parent trait, with MultiTaskIndependentLearner
and MultiTaskCorrelatedLearner
subtypes for the two cases in question.
And as per my comments below, it seems like much of the other MultiModel
, etc types could also follow this hierarchy.
I am unsure how feasible this is, and do recognize it would be a large re-factor. But from a logical perspective, this seems like the right choice to me.
At a high level, removing the
one model per label
seems like the natural approach to me. This could coincide with an abstraction ofMultiTaskLearner
to be the parent trait, withMultiTaskIndependentLearner
andMultiTaskCorrelatedLearner
subtypes for the two cases in question.And as per my comments below, it seems like much of the other
MultiModel
, etc types could also follow this hierarchy.I am unsure how feasible this is, and do recognize it would be a large re-factor. But from a logical perspective, this seems like the right choice to me.
The first thing I tried was to make a single MultiTaskLearner
trait and then sub-type it into a "makes N models" trait and a "makes one model" trait. But in that case, the abstraction doesn't get you anything. The only method these learners share is train
, but what's the signature? The parameters are clear:
def train(inputs: Seq[Vector[Any]], labels: Seq[Seq[Any]], weights: Option[Seq[Double]] = None)
But what's the return type? Is it Seq[TrainingResult]
or TrainingResult
? If it's the former, then the "makes one model" trait ends up with implicit logic of the form "we assume that this method returns a sequence of length 1". And there's no shared code to make it worthwhile. But the latter is incompatible with the "makes N models" trait. I suppose in principle one could make no assumptions about which training results correspond to which labels, and include machinery as part of the trait that does the mapping, but that would be needlessly complex. We have no need of a situation in which model 1 predicts the first 3 labels, model 2 predicts the fourth label, and model 3 predicts the fifth and sixth labels. So this is a bad idea.
There's a lot of value in enforcing the interface "train
outputs a single TrainingResult
", at least for correlated multitask learners. It allows for the correlation method to be exposed, and it's just more specific than a sequence, so there are fewer assumptions in the code.
Another thing that's valuable, and that gets lost by having some multitask models output Seq[TrainingResult]
and some output TrainingResult
, is composability. The multitask version of the bagger, feature rotator, and feature standardizer learners all take some other multitask learner as a parameter. If we mix and match training behaviors, then that falls apart. Hence the need for a parallel set of classes.
I'm thinking that the best course of action is to have all multitask learners output a single training result. The interface doesn't need to care about the details of how those models are trained or how they make predictions, it just needs to enforce how the information is presented. It's general, because a set of N independent models can easily be wrapped into a single model. And it's a stricter return type than a Seq
is. If there's need for N
independent models, then we could expose a method, asIndependentModels
that spits out N
models for N
labels.
The downside is that this is a breaking change to the MultiTaskLearner
interface. But it wouldn't necessarily be a big logical change. I'll give it a try.
I refactored the existing MultiTaskLearner
to return a single MultiTaskTrainingResult
after training, instead of a sequence, and I think the result is much better. There's very little duplicated logic between the multitask and singletask learners, and the multitask version of the FeatureRotator
and Standardizer
now play nicely with the ParallelModels
class, which again leads to only a small amount of code duplication.
My biggest concern with this approach is that it threatens to layer an increasingly complicated contraption atop a foundation not designed to bear its weight: for example, what is there to reuse when it comes time to build non-Gaussian uncertainty distributions? ... To me, this PR is a convincing case for restructuring the code on the basis of first-class support for abstract probability distributions as both inputs and outputs: rather than acting on and returning Seq[Any], let models act on and return Distribution objects that may include Gaussian and UncorrelatedGaussian as special cases.
I disagree on several fronts.
For me, putting this together did highlight the issues in Lolo around the use of Any
. We lean on it in many places to allow for Double/Char/Boolean, which makes it difficult to build against the interfaces. But that's not a new issue, and if anything I think that this PR is a slight improvement in a few areas, because it is sometimes able to be more explicit about where Doubles are expected and where Sequences are expected.
In my opinion, much of the complication in the first draft was related to the fact that multitask learners were returning sequences of training results, which makes them different from single-task learners, and also obscures an assumption (that one training result corresponds to one label). By refactoring MultiTaskLearner
to return a single MultiTaskTrainingResult
, I think it slots in fairly well along with the single-task learners.
Regarding non-Gaussian distributions, I don't see those as being a concern. The theory that we have (Wager et al) is focused on estimating the standard error of ensemble predictions, and that's what we report as the uncertainty. That theory can be extended to compute a correlation coefficient between two uncertainties, which we will expose using this new interface. Drawing an error bar or slapping a (multivariate) Gaussian on the prediction is certainly an approximation, but a. It's not an approximation that Lolo is making (software that consumes Lolo may make that approximation) b. We have no good reason, at the moment, to make any other approximation
If we knew that this Gaussian approximation was a significant issue and we had some compelling alternative theory to generate confidence intervals, then we could plan around a more generic Distribution
type. But neither of those things are true, so it would be foolish to prematurely try and build for it. At the moment, getUncertainty
always returns a Double
for real-valued predictions. So I think it's safe and meaningful to make getUncertaintyCorrelation
also return a Double. It hardens the contract around current behavior. Switching to distributions would be a complete re-creation. We wouldn't be digging out of the current hole; we'd be teleporting to the top and digging a new hole. So might as well dig our current hole deeper and spruce it up a little.
Let me get the tl;dr out of the way first: I agree re-creation is a separate scope.
I disagree on several fronts.
1. With the most recent changes, I don't think that this PR adds an unwieldy layer
I don't mean to say that this PR is unwieldy. On the contrary I think it does a great job of meeting the codebase where it is, with a new need. Sorry if that came across otherwise. Rather, I am concerned about the long-term effect of building multivariate things that were engineered only to serve univariate cases. As you highlighted, there was code smell associated with introducing parallel interfaces that shared much of the same code; the same is true for the parallel maintenance of learners and multi-task learners. But I admit my opinion about the topic in front of us, which proved pessimistic, was formed before giving a good look at your latest round of changes. Updating the interface with MultiTaskTrainingResult was a big improvement.
2. I don't think that this work makes non-Gaussian distributions more urgent
Agreed.
3. I don't think that non-Gaussian distributions need to be supported
I don't agree. I certainly don't think they need to be supported by this PR but I do consider them a likely target for future work, so I want that to be in mind when we update these areas of the code.
If we knew that this Gaussian approximation was a significant issue and we had some compelling alternative theory to generate confidence intervals, then we could plan around a more generic
Distribution
type. But neither of those things are true, so it would be foolish to prematurely try and build for it. At the moment,getUncertainty
always returns aDouble
for real-valued predictions.
There is literature that motivates the use of non-Gaussian multivariate predictive distributions, with particular application to small data like we face; one particular model type we may want to consider comes with uncertainty quantification developed from first principles. I'll talk to you about that separately.
A high level comment: I find some of the naming conventions that were already present quite hard to infer. For instance,
BaggedMultiResult
is aPredictionResult[Double]
that does not implement your newMultiTaskModelPredictionResult
, whereasMultiTaskBaggedResult
does.
I agree. When I first started looking into the existing code, it took me some time to realize that BaggedMultiResult
referred to multiple predictions, not multiple labels. I suppose we can rename it now, since we're already making breaking changes.
Any suggestions? BaggedMultiPredictionResult
could work, but that could also be interpreted as multiple labels (a single "multi-prediction"). What about SinglePredictionBaggedResult
and MultiPredictionBaggedResult
?
What about
SinglePredictionBaggedResult
andMultiPredictionBaggedResult
?
I like this option. It makes it clear that BaggedSingleResult
and BaggedMultiResult
are their own concept, compared to MultiTaskBaggedResult
which is a separate case.
PLA-8517
The current approach to multitask learning is to choose the decision tree splits once by taking all labels into account, but return a sequence of models, one for each label. Because these models are independent objects, we lose the ability to potentially compute uncertainty correlation information. The goal of this PR is to expose a
getUncertaintyCorrelation
method in a type-sensible way.Unfortunately, the existing paradigm around multitask learning assumes that the learner outputs one model for each label. There are several places where this is enforced, either explicitly or subtly (and an exception will occur if this requirement is not met).
The approach I've tried here is to create a parallel series of objects, starting with
MultiTaskCombinedLearner
, which is a spiritual peer toMultiTaskLearner
, but instead of training outputting aSeq[TrainingResult]
, it outputs a singleMultiModelTrainingResult
. This yields aMultiModel
, which makes predictions that areMultiModelPredictionResult
. The key type enforcement is thatMultiModelPredictionResult
predictsSeq[Any]
. This allows us to include some information and methods inMultiModel
that reflect the fact that predictions correspond to multiple labels.With this type structure, the
getUncertaintyCorrelation
method is sensible forMultiModelPredictionResult
, and any training that starts with aMultiTaskCombinedLearner
is guaranteed to result in predictions for which thegetUncertaintyCorrelation
method applies (although it might returnNone
). The method takes two arguments, indicesi
andj
, and returns an optional sequence of correlation coefficients for each prediction.A major drawback of this approach is that it creates parallel sets of objects that share much of their logic with existing objects, even if they're not connected. Most of the shared logic is abstracted into mix-in traits, specifically
AbstractMultiTaskBagger
andMultiTaskTree
, so there's little repetition of logic. But it creates boilerplate and doesn't scale easily. If we want theStandardizer
orFeatureRotator
learners to be applicable toMultiTaskCombinedLearner
, then we'll have to create new versions of the learner/training result/model/prediction result objects. There will be very little logic in these objects, but it's clunky.Unfortunately, since the "multitask learners produce one model per label" assumption is pervasive, I'm finding it difficult to come up with a structure that doesn't result in so much boilerplate. For example,
MultiTaskStandardizer
makes this assumption when it zips the results ofbaseLearner.train
together with the outputs.Outstanding question:
getUncertaintyCorrelation()
?Outstanding issues for this or follow-up PRs:
getUncertaintyCorrelation
Standardizer
andFeatureRotator
work with multi-task combined models (Standardizer
is especially important because we want the inputs to be on the same scale when the splits are being calculated).