Direct API: Scenarios to light up for V1

TomFinley commented 6 years ago

The following is a preliminary list of required scenarios for the direct access API, that we will use to focus the work. The goal is we want the experience for these to be good and unproblematic. Strictly speaking everything here is possible to do right now using the components as they stand implemented today. However, I would say that it isn't necessarily a joy to do them, and there are lots of potential "booby traps" lurking in the code unless you do everything exactly correctly (e.g., #580).

Simple train and predict: Start with a dataset in a text file. Run text featurization on text values. Train a linear model over that. (I am thinking sentiment classification.) Out of the result, produce some structure over which you can get predictions programmatically (e.g., the prediction does not happen over a file as it did during training)..
Multi-threaded prediction. A twist on "Simple train and predict", where we account that multiple threads may want predictions at the same time. Because we deliberately do not reallocate internal memory buffers on every single prediction, the PredictionEngine (or its estimator/transformer based successor) is, like most stateful .NET objects, fundamentally not thread safe. This is deliberate and as designed. However, some mechanism to enable multi-threaded scenarios (e.g., a web server servicing requests) should be possible and performant in the new API.
Train, save/load model, predict: Serve the scenario where training and prediction happen in different processes (or even different machines). The actual test will not run in different processes, but will simulate the idea that the "communication pipe" is just a serialized model of some form.
Train with validation set: Similar to the simple train scenario, but also support a validation set. THe learner might be trees with early stopping.
Train with initial predictor: Similar to the simple train scenario, . The scenario might be one of the online linear learners that can take advantage of this, e.g., averaged perceptron.
Evaluation: Similar to the simple train scenario, except instead of having some predictive structure, be able to score another "test" data file, run the result through an evaluator and get metrics like AUC, accuracy, PR curves, and whatnot. Getting metrics out of this shoudl be as straightforward and unannoying as possible.
Auto-normalization and caching: It should be relatively easy for normalization and caching to be introduced for training, if the trainer supports or would benefit from that.
File-based saving of data: Come up with transform pipeline. Transform training and test data, and save the featurized data to some file, using the .idv format. Train and evaluate multiple models over that pre-featurized data. (Useful for sweeping scenarios, where you are training many times on the same data, and don't necessarily want to transform it every single time.)
Decomposable train and predict: Train on Iris multiclass problem, which will require a transform on labels. Be able to reconstitute the pipeline for a prediction only task, which will essentially "drop" the transform over labels, while retaining the property that the predicted label for this has a key-type, the probability outputs for the classes have the class labels as slot names, etc. This should be do-able without ugly compromises like, say, injecting a dummy label.
Cross-validation: Have a mechanism to do cross validation, that is, you come up with a data source (optionally with stratification column), come up with an instantiable transform and trainer pipeline, and it will handle (1) splitting up the data, (2) training the separate pipelines on in-fold data, (3) scoring on the out-fold data, (4) returning the set of evaluations and optionally trained pipes. (People always want metrics out of xfold, they sometimes want the actual models too.)
Reconfigurable predictions: The following should be possible: A user trains a binary classifier, and through the test evaluator gets a PR curve, the based on the PR curve picks a new threshold and configures the scorer (or more precisely instantiates a new scorer over the same predictor) with some threshold derived from that.
Introspective training: Models that produce outputs and are otherwise black boxes are of limited use; it is also necessary often to understand at least to some degree what was learnt. To outline critical scenarios that have come up multiple times:
- When I train a linear model, I should be able to inspect coefficients.
- The tree ensemble learners, I should be able to inspect the trees.
- The LDA transform, I should be able to inspect the topics.
I view it as essential from a usability perspective that this be discoverable to someone without having to read documentation. E.g.: if I have var lda = new LdaTransform().Fit(data) (I don't insist on that exact signature, just giving the idea), then if I were to type lda. in Visual Studio, one of the auto-complete targets should be something like GetTopics.
Exporting models: Models when defined ought to be exportable, e.g., to ONNX, PFA, text, etc.
Visibility: It should, possibly through the debugger, be not such a pain to actually see what is happening to your data when you apply this or that transform. E.g.: if I were to have the text "Help I'm a bug!" I should be able to see the steps where it is normalized to "help i'm a bug" then tokenized into ["help", "i'm", "a", "bug"] then mapped into term numbers [203, 25, 3, 511] then projected into the sparse float vector {3:1, 25:1, 203:1, 511:1}, etc. etc.
Meta-components: Meta-components (e.g., components that themselves instantiate components) should not be booby-trapped. When specifying what trainer OVA should use, a user will be able to specify any binary classifier. If they specify a regression or multi-class classifier ideally that should be a compile error.
Extensibility: We can't possibly write every conceivable transform and should not try. It should somehow be possible for a user to inject custom code to, say, transform data. This might have a much steeper learning curve than the other usages (which merely involve usage of already established components), but should still be possible.

Companion piece for #583.

/cc @Zruty0 , @eerhardt , @ericstj , @zeahmed , @CESARDELATORRE .

eerhardt commented 6 years ago

Overall, I like the list, thanks for writing this up, @TomFinley.

One place I think it could be a little more extensive is in the "inference/prediction" scenarios. The only real place I see it mentioned is in the 2nd bullet

Serve the scenario where training and prediction happen in different processes (or even different machines).

It would be great if we could drill into the purely "inference/prediction" scenarios a bit more. For example, how to do inference in a multi-threaded application (say an ASP.NET WebAPI Controller). We should have some story here. Today, in our GitHub issue labeler service, we are deserializing the model on every request because PredictionModel isn't a thread-safe object.

Zruty0 commented 6 years ago

Yes, our PredictionEngine is not thread-safe.

We had a doc somewhere on three possible ways to make thread-safe prediction using PredictionEngine with different repercussions, but I struggle to find it now...

ArieJones commented 6 years ago

+1 for File-based saving of data. I am right now trying to find a way in which I could get the "featurized" data out of the pipeline so that I could possibly feed it into a different system to consume and do some analysis with it.... such as R.

ericstj commented 6 years ago

In addition to export of models, import of some models should work as well. It's also interesting to look at what capabilities are available after import (or even load of a saved model). At a minimum we'd need to be able to predict. It would also be valuable to retrain and evaluate. I can imagine that for some cases of imported data it may not be possible to modify the model in a meaningful way, if that's the case we should be clear about that.

zeahmed commented 6 years ago

Building on what @ericstj said. Since we are moving towards estimators/transformers design, it would be a good idea to reload the estimator pipeline (for retraining) from the saved model (or have a way to save the estimator pipeline and load it back). This will make sharing of pipelines bit easier. Though, it can be done with C# code itself currently.

TomFinley commented 6 years ago

It would be great if we could drill into the purely "inference/prediction" scenarios a bit more. For example, how to do inference in a multi-threaded application (say an ASP.NET WebAPI Controller)

That is a good point @eerhardt , I should mention multi-threaded prediction. (I have edited with an addendum "Multi-threaded prediction".) I am somewhat curious about why this solution of deserializing on every call -- assuredly people that internally use this library to serve web requests don't do that, so I wonder why it was necessary in that case. (Something peculiar to the abstractions adopted by the entry-point-based API, I'm guessing?) Anyway, yes, I should hope we can do better than this.

Is this an adequate description of the thing I should add? If you approve (or have suggestions and clarifications) I'll edit the issue and append it.

Multi-threaded prediction. A twist on "Simple train and predict", where we account that multiple threads may want predictions at the same time. Because we deliberately do not reallocate internal memory buffers on every single prediction, the PredictionEngine (or its estimator/transformer based successor) is, like most stateful .NET objects, fundamentally not thread safe. This is deliberate and as designed. However, some mechanism to enable multi-threaded scenarios (e.g., a web server servicing requests) should be possible and performant in the new API.

@ArieJones yes. We have some existing code for this in "savers," (e.g., text saver, binary saver, etc.), but it is currently buried in the Runtime namespaces which we're thinking of exposing. (At least, much of it.)

Next point:

In addition to export of models, import of some models should work as well.

@ericstj ah. So for example, not only export to ONNX/ONNX-ML, but have the ability to somehow incorporate ONNX/ONNX-ML pipelines as well into pipelines. I agree that this would be useful. I'm unsure that I want to put it that directly. You'll note that I've tended to describe scenarios to elucidate what might be troublesome architectural issues with the abstractions in the API, and less about advocating for this or that specific component. (If I had been doing the latter exercise, I certainly would have mentioned incorporating a PyTorch or Tensorflow trainer and predictor. 😄) But perhaps I can put it like this:

Expressive transformation: The estimator and transformer architecture should be robust and expressive enough to cover scenarios of wrapping external components, e.g., a transformer capable of pumping a dataset through ONNX-ML models, wrapping TensorFlow.

What do you think @ericstj? Something about my description seems weak, since sufficiently expressive is such a vague requirement that I'm not sure it will be useful for @Zruty0 's task of writing samples to validate the API (whether estimator/transformer based or not).

@zeahmed thanks for your comment! Could you clarify this ask a bit more? While persistable transforms are essential to the framework being useful at all, the scenario for persistable estimators is less clear to me. Retaining information on how a model was trained is certainly important, but I consider this best served by the ideas behind so-called "model management," for which merely saving an estimator/trainer by itself would not necessarily be terribly helpful compared to, say, having the code that did the training actually checked in. But this may simply be my own lack of imagination.

zeahmed commented 6 years ago

Thanks @TomFinley, Yes debuggability of the model is one of the scenario behind saving the estimator pipeline.

However, I also see its use in online training case where it is essential to know which algorithm/estimator was used to generate the model. This is relevant to the following scenario

Train with initial predictor: Similar to the simple train scenario, . The scenario might be one of the online linear learners that can take advantage of this, e.g., averaged perceptron.

I feel having the code here is not sufficient because during online training all the pipeline components will be used as-is except for the learner that will be further trained. Maybe you already have a solution for this in mind but I just wanted to point out this scenario in case.

Zruty0 commented 6 years ago

I think 'train with initial prediction' could be just implemented as a version of a constructor to the trainer that takes a 'linear predictor transformer' as one of the arguments and trains on top of it.

In this particular case I don't see a need to memorize the pipeline anywhere.

ericstj commented 6 years ago

@TomFinley I'm looking for less of a promise of what will work and more of a promise of a description of what will work. I would imagine that for some file-formats they behave as a save file where you can resume doing anything with them that you were able to do before save, whereas some export/imports are lossy: you cannot do as much with them once importing. I'm not sure I see the value (or even testability) of an export without import.

Zruty0 commented 6 years ago

It is, in my view, totally fine to have one-way exports. Like 'export predictor as human-readable text'.

ericstj commented 6 years ago

@Zruty0 that's fair, though the end is different. I would imagine we'd want a strong reason for doing 1-way exports if the end is some functionality also supported by ML.NET, otherwise you create a channel for moving people away from ML.NET instead of coming to it.

Zruty0 commented 6 years ago

@TomFinley While I was implementing TrainWithValidationSet and TrainWithInitialPredictor, I discovered that the current low-level API has certain peculiarities about them:

You can give a validation set to any trainer. If the trainer doesn't expect validation set, it is ignored.
You can give an initial predictor to any trainer. If the trainer doesn't support incremental training, it is ignored.
You can give a non-linear predictor as initial to AveragedPerceptron (and, I assume, all other linear incremental learners). It will compile, but the trainer will throw a runtime error "Not a linear predictor" when you attempt to train it.

With the estimators-based API it would be possible to disallow all he above three cases at compile-time, if we have a specialized Train method in them (which is what we plan on having).

eerhardt commented 6 years ago

@TomFinley - Your multi-thread write up looks good to me. I've added it to the top post in this issue.

I am somewhat curious about why this solution of deserializing on every call -- assuredly people that internally use this library to serve web requests don't do that, so I wonder why it was necessary in that case. (Something peculiar to the abstractions adopted by the entry-point-based API, I'm guessing?) Anyway, yes, I should hope we can do better than this.

The reason is because that solution is using the PredictionModel<,> class, which wraps a BatchPredictionEngine<,> object, which isn't thread safe.

Zruty0 commented 6 years ago

Another batch of findings from the implementation work:

All of the 'strong types' associated with the trained transformers can not survive saving and reloading. That is, if you save, say, a PredictorTransformer<LinearModel> and then load it back, you receive an ITransformer. You can cast it again to the original type of course.

Zruty0 commented 6 years ago

I think that this issue can safely be closed now.

dotnet / machinelearning

Direct API: Scenarios to light up for V1 #584