Direct API discussion: ITrainer proposed changes

There are two changes proposed here for ITrainer. Due to the nature of the changes, assuming we agree they are good ideas, it really makes sense that they ought to happen in one go, since they involve changes to the same core functionality.

I am quite certain the first thing is a good idea, but the second thing I am less certain about. (Of course my confidence could be misplaced. :smile:) People that occur to me as potentially being good contributors to the discussion would be @eerhardt , @zeahmed , @shauheen , @ericstj , @glebuk . (To be clear, this is not exclusionary. Anyone can and should feel free to comment. I just want these people to get flagged is all. :smile: )

`ITrainer<TData, TPred>.Train` ought to return the predictor

Currently in order to train a predictor, there is a two step process. You call ITrainer.Train then call ITrainer.GetPredictor. As near as I can tell this arrangement was meant as some scheme to support online training, but of course that vision never came to pass, and if we were to actually support online training I have to imagine it would be through some separate interface anyway.

This arrangement seems unambiguously bad. It necessitates making ITrainer objects stateful for no particularly good reason. This complicates both the implementation and usage of the class, since (1) the caller can't do things like call Train multiple times even though a developer, seeing this interface, might reasonably suppose such a thing were possible and (2) the author of the ITrainer component has to protect against that misuse.

Get rid of `IValidatingTrainer`, `IIncrementalTrainer`, `IValidatingIncrementalTrainer`

The problem

First let's talk about the problem...

Most (all?) trainers implement ITrainer<RoleMappedData>, based on this interface here.

https://github.com/dotnet/machinelearning/blob/828dc227f4d7346e11094479c7a2e443addc8102/src/Microsoft.ML.Core/Prediction/ITrainer.cs#L105

We have historically followed adding more inputs to the training process by declaring specialized interfaces that represent the Cartesian product of all possible permutations of inputs, as we see:

There was a discussion about adding a validation set. So we doubled the number of interfaces to two, ITrainer, and IValidatingTrainer.
Later on there was a discussion about adding continued training based on an initialization. So we again doubled the number of interfaces, introducing IIncrementalTrainer and IValidatingIncrementalTrainer.
There have been discussions of adding a test set to allow computing metrics as training progresses. Following the same strategy we would of course again double the number of interfaces (the full set being represented, perhaps, by IValidatingIncrementalTestingTrainer), for a total of eight.
If hypothetically we were to somehow allow for one more input beyond that, we'd have a total of sixteen interfaces.

Etc. etc. That there is this exponential cost makes clear something is misdesigned. This has cost not only here and in the implementations of ITrainer, but in the usage as well. Here we see a method that explores the cartesian product of possible interfaces so it can call the right one. It seems to me something is wrong here when just calling "train" requires a fairly non-obvious utility method to make sure we call the "right" train.

This issue incidentally is the primary reason why we haven't done anything like add support for test set metrics during training (despite the many requests). That is, it is not any technical difficulty with the idea itself, it's just that writing such a thing would make the code unmaintainable.

The possible solution(s)

So: instead we might just have one interface, with one required input (the training dataset), and all these other things are optional.

There are two obvious ways I could imagine doing this, first explicitly as part of the method signature on ITrainer<...>:

public interface ITrainer<TDataset, TPredictor> {
    TPredictor Train(TDataset train, TDataset validation = null, TDataset testSet = null, IPredictor initPredictor = null); }

Or else have some sort of context object. (I'm not married to any of these names, to be clear. :smile: )

public sealed class TrainContext {
    public RoleMappedData Train { get; }
    public RoleMappedData Validation { get; }
    public RoleMappedData Test { get; }
    public IPredictor InitPredictor { get; }
}

and all trainers implement ITrainer<TrainContext> instead of ITrainer<RoleMappedData>.

The latter is perhaps a bit more awkward since it involves the addition of a new abstraction (the hypothetical TrainContext), but it is more flexible in a forward-looking sense, since if we add more "stuff" to how we initialize trainers, we won't break all existing ITrainer implementations. (My expectation is that trainers that can't support something would simply ignore.)

In current design, I have an easy way to detect, does this algorithm support Validation or not, by just looking on class interfaces. Should we preserve it and add public bool SupportValidationSet { get; }

Hi @Ivanidzo4ka , yes, I'd think such a mechanism would be necessary. Maybe something akin to ITrainerEx. Or even part of ITrainerEx itself, where the default implementation in TrainerBase would be "no."

I'd be in favor of the TrainContext approach. It allows us to add new options in the future without adding more interfaces, or breaking existing interfaces (like the single ITrainer.Train(TDataset train, TDataset validation = null, TDataset testSet = null, IPredictor initPredictor = null)).

Just throwing another possibility out for discussion: We could have an abstract Trainer class instead of an interface. And as we add new "capabilities" (like validation or incremental) we add a boolean property SupportsCapability to the class. By default the property would be false, so existing trainers that don't support the capability would tell consumers that the capability isn't supported, and thus going to be ignored.

A separate design question on this hierarchy: Why do we have the need for interface ITrainer<in TDataSet> if TDataSet is always RoleMappedData? Can't we remove the generic type here for simplicity? (The same question holds after "and all trainers implement ITrainer<TrainContext> instead of ITrainer<RoleMappedData>." Why do we need ITrainer<TrainContext> if all trainers use the same type?)

Thanks @TomFinley for the proposal. These are the reasonable changes to make API more convenient for end-users.

I am fine with both options. However, the following option seems more convenient as one knows in a glance what this method expects and what is expected from this method. This won't be clear in other option unless one inspects TrainContext. It semantics also resembles other ML API training methods.

public interface ITrainer<TDataset, TPredictor> {
    TPredictor Train(TDataset train, TDataset validation = null, TDataset testSet = null, IPredictor initPredictor = null); }

I suggest that we can follow ITrainer<TrainContext> approach as it is scalable and have convenience methods around this for end-users similar to option 1 (if possible).

One thing I see in both options is that some parameters won't make any sense for some trainers. E.g. not all trainers support validation during training. Even user pass the validation argument, there will be no use of this parameter. So, we might need to document these exceptions clearly. Similar is the case with initPredictor.

Keep the virtual contract as simple as possible, in that sense the context solution is better. Extension methods on the interface could add convenience without changing the virtual contract.

eg:

static class TrainerExtensions
{
    public static TPredictor Train(this ITrainer<TDataSet,TPredictor> trainer, TDataset train, ...)
    {
        return trainer.Train(new TrainContext()
        {
           Train = train,
           ...
        };
    }
}

Alternatively, as @eerhardt mentioned, an abstract base class can have a single virtual with concrete implementations of the convenience methods. That's how we would typically do this in the framework (assuming "Trainer" was the primary behavior of the object and you didn't need the multiple-inheritance characteristics you'd get from an interface).

I suggest that we can follow ITrainer<TrainContext> approach as it is scalable and have convenience methods around this for end-users similar to option 1 (if possible).

Hmmm OK. An extension method, perhaps? (Ah, that was funny, as I finished writing this sentence, I see @ericstj just replied, with the idea of using extension methods. :smile: )

One thing I see in both options is that some parameters won't make any sense for some trainers.

Indeed. I think though that @Ivanidzo4ka 's suggestion of ITrainerEx properties (or something like that) helps resolve that, and ultimately more gracefully than the current code.

@ericstj yes something like that. In this case however TDataset would not be generic but specifically be a RoleMappedData. And we would probably not want to make the TrainContext mutable.

Oh I somehow missed that @eerhardt had replied before @zeahmed hi Eric. 👋

The use of an actual abstract base class may be possible. We of course were using multiple inheritance in this very thing (since an IValidatingIncrementalTrainer would descend from both IValidatingTrainer and IIncrementalTrainer), but part of this proposal is that I'm getting rid of that mechanism anyway so that's a moot point. There is also this strange thing IHasLabelGains : ITrainer but that thing seems useless anyway. (Maybe there was some prior mechanism that needed this somehow.)

Edit: That would also resolve the strange thing with ITrainerEx.

A separate design question on this hierarchy: Why do we have the need for interface ITrainer<in TDataSet> if TDataSet is always RoleMappedData?

I considered proposing this as well. It may be a good idea... the only time that generic interface has been useful is when we've changed it, but when we've changed it, we've changed it everywhere... e.g., ITrainer implementations were initially (as far as I know) ITrainer<Instances> then they changed to ITrainer<IDataView> then they were ITrainer<RoleMappedData> and now we propose making them ITrainer<TrainContext> (or some other class name than TrainContext, please feel free to suggest another name.) But otherwise that's just a needless complication, and it's weird to say, "hey we have this generic so that when we inevitably break everything the transition is a bit easier. :smile:"

Just to summarize, this is what I get from @eerhardt, @ericstj, @zeahmed, and @Ivanidzo4ka into a more or less basic POR.

No disagreement on first point. Train should return the predictor. Trainers can therefore migrate to being simpler, stateless objects.
TrainContext-based train is preferred solution on second point for various reasons summarized above. This will be a class with three properties, one required (training set) the other two optional (validation set and initial predictor) and possibly null.
ITrainer nuked in favor of Trainer abstract base class. That will contain the functionality of both ITrainer as well as ITrainerEx. Most trainers descend from TrainerBase anyway, so this should be an "easy" transition. :smile: (Edit: harder, I just realized you can't have covariance on classes, which is annoying.)
What had been TDataset generic parameter on trainers no longer generic, but is specifically over an TrainContext. I do not see anywhere that any ITrainer implementation exists that is not also an ITrainer<RoleMappedData>.

TIL that you can't have covariant type parameters on classes, only on interfaces and delegates. 😦

Hey @eerhardt, I guess I never actually tried before otherwise I might have known, but covariant type parameters can't be put on classes... you can do interface ITrainer<out TPredictor> { } but can't do class Trainer<out TPredictor> { }.

So while a ITrainer<LinearRegressionPredictor> is also an ITrainer<LinearPredictor> (yes, this is useful), a Trainer<LinearRegressionPredictor> is not a Trainer<LinearPredictor>.

That's really unfortunate. I was rather looking forward to nuking ITrainerEx. 😝

Edit: Maybe, we can have a covariant interface implemented by Trainer<TPred>, called ITrainerThingie<out TPred> { Trainer<TPred> Trainer { get; } }, and the only thing you can do with ITrainerThingie by itself is just get the predictor out of it, and otherwise it returns an (internal) wrapping of Trainer? Because my expectation is that most things don't need covariance, but the few things that need it would still have a path to do so. Something about that solution makes my skin crawl though. 😦

Edit edit: No, that's a silly idea. There should be exactly one abstraction. If adopting the second abstraction is necessary to overcome the limitations of the first, then we should not have adopted the first abstraction.

I agree this is unfortunate.

Maybe we could re-evaluate whether we need the out TPredictor generic type at all, the only thing I can image will be used for is the return type from the newly proposed Train method above.

Instead, we could have a design that went something like this (obviously I'm skipping a bunch of details, but just giving the overall approach idea):

public abstract class Trainer
{
    public Predictor Train(TrainContext context)
    {
        return this.TrainPredictor(context);
    }

    protected abstract Predictor TrainPredictor(TrainContext context);
}

public class LinearTrainer : Trainer
{
    public new LinearPredictor Train(TrainContext context)
    {
        // real logic happens
        return new LinearPredictor();
    }

    protected override Predictor TrainPredictor(TrainContext context)
    {
        return this.Train(context);
    }
}

public class TrainContext {}
public class Predictor {}
public class LinearPredictor : Predictor {}

The idea being - if I have a Trainer, I can call Train on it, and I get back the base Predictor object. But if I know the specific kind of Trainer/Predictor pair I'm working with (in this case the LinearTrainer and LinearPredictor), when I call myLinearTrainer.Train(), I get back an LinearPredictor because the Train method is new on the concrete class level.

The advantage of this approach is that consumers don't need to deal with a generic type at all. If they want to write general logic that can take any Trainer/Predictor pair, they code at the base class level. If they know the specific type they are working with - they can cast down to that concrete type and get the extra methods/properties/etc.

The way we have it today, with ITrainer<out TPredictor>, consumers always need to fill out the generic TPredictor type. If they know the concrete type, they can fill it in. But if they want to write general logic that can take any Trainer/Predictor pair, then they themselves need to take the generic type, and the onus is on their callers to fill in the generic type. Or the caller knows they can fill the generic type with ITrainer<IPredictor>, which isn't obvious IMO.

Yes I see. Allow me to share some feedback on that proposal, based on how such trainers are actually used. (Just a warning: I am going to disagree with you, but I hope it is evident that I respect your position, I consider your positions very well considered and thought out, and my disagreement was not easy. I really did genuinely want there to be a solution based on an abstract base class, and it is only after much thought that I have to consider that arrangement impossible and undesirable.)

You have posited the existence of a hierarchy of trainers... LinearTrainer descending from Trainer and so forth. Consider, say, the SDCA binary classification trainer, which would surely descend from this hypothetical LinearTrainer class. But is the fact that it's a linear classifier important to the caller, or the fact that the predictor can also in some sense have global feature importances, per instance feature importances, can be parameter mixed, is a binary classifier, etc. etc.? That of course depends on what is calling it... global feature importance is important to feature selection transform, per instance feature importances the so-called WTF scorer, parameter mixed important to ensembling, binary classifier to OVA. Or even some other criteria we haven't even considered yet. So there is a multiple inheritance problem, and each of these has considerations utterly orthogonal to the others.

Happily, that multiple inheritance problem is already covered on the IPredictor side. (This frankly is where it belongs, since most subsequent processing based on training a machine learning algorithm will obviously be over the product of the data analysis, a predictor, vs. the analysis procedure itself, a trainer.) This brings up a second point: while multiple inheritance is possible on this abstract class Trainer side thorugh marker interfaces, I wouldn't want to duplicate what was already done on the predictor side. Especially since the only thing that people ever really do to trainers is (1) instantiate them and (2) get predictors out of them, I'd really prefer to keep that complexity at the level of the predictors since again they'd be doomed to be the actual objects people wind up caring about anyway. (See prior point in sentence in this paragraph starting, "This frankly.")

This in turn leads me to, perhaps, not fully agree with the point about a user somehow not being "helped" by generics (your point in the paragraph beginning "The advantage of this approach is that consumers don't need to deal with a generic type at all"), since the alternative that is envisioned is that there would be an inheritance hierarchy on trainers that mirrors, without being in any way syntactically bound by, the same inheritance structure found on predictors. I can't say I like that solution very much, not only because of its intrinsic complexity but also because the problem of semantically-parallel-but-syntactical-unrelated inheritance structures it invites is precisely the principal problem generics were invented to solve. In the .NET framework we don't have ListOfInts we have List<int>, and that's good, not bad. I have about as much enthusiasm in seeing LinearTrainer come to pass as I have seeing ListOfInts becoming available in System.Collections.

This is probably enough. However, there is also dependency injection to consider, which touches upon some further out plans that I have. (As you might be able to tell from my issues and PRs, this issue by itself is just one of a longer term plan in the service of #371 .) Being able to say "tell me all trainers that produce this type of predictor" is something used a lot. We actually currently do this via some scheme called "signatures," e.g., here, but this is IMO a poor solution. I have a secondary plan to replace this idiom, as part of a broader plan to do away with ML.NET's custom-coded dependency injection framework. But I'm going to have to replace it with something, and unfortunately all possibly candidate somethings require establishing an is-a relationship on trainer implementing classes, which obviously requires covariance.

Architectural considerations addressed, let's move on to more casual user-facing considerations:

The way we have it today, with ITrainer<out TPredictor>, consumers always need to fill out the generic TPredictor type. If they know the concrete type, they can fill it in. But if they want to write general logic that can take any Trainer/Predictor pair, then they themselves need to take the generic type, and the onus is on their callers to fill in the generic type. Or the caller knows they can fill the generic type with ITrainer<IPredictor>, which isn't obvious IMO.

I don't object per se, but for these people, happily, given that we are now not going to make the dataset type generic and instead just say ITrainer of any stripe always takes a TrainContext, ITrainer can and will have a method with signature IPredictor Train(TrainContext). So your requirement will certainly be there.

The reason I don't object however is purely on practical grounds since it is a moot point. The hypothetical population of people you're talking about that would care about this issue are those sophisticated enough to realize that there is such a thing as a top level ITrainer (most people I expect will just instantiate the specific sealed class implementor of trainer they want, call train, and be done with it), while simultaneously unsophisticated enough to not realize that (1) there was a generic on the way to that ITrainer and (2) that generic was instantiated over a specific implementation of IPredictor. I'm certain the first class of person exists, I am less convinced but still willing to entertain the possibity of the second person, but I don't consider the intersection of those two sets being the non-empty set plausible.

dotnet / machinelearning