dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Direct API: Scenarios to light up for V1 #584

Closed TomFinley closed 6 years ago

TomFinley commented 6 years ago

The following is a preliminary list of required scenarios for the direct access API, that we will use to focus the work. The goal is we want the experience for these to be good and unproblematic. Strictly speaking everything here is possible to do right now using the components as they stand implemented today. However, I would say that it isn't necessarily a joy to do them, and there are lots of potential "booby traps" lurking in the code unless you do everything exactly correctly (e.g., #580).

Companion piece for #583.

/cc @Zruty0 , @eerhardt , @ericstj , @zeahmed , @CESARDELATORRE .

eerhardt commented 6 years ago

Overall, I like the list, thanks for writing this up, @TomFinley.

One place I think it could be a little more extensive is in the "inference/prediction" scenarios. The only real place I see it mentioned is in the 2nd bullet

Serve the scenario where training and prediction happen in different processes (or even different machines).

It would be great if we could drill into the purely "inference/prediction" scenarios a bit more. For example, how to do inference in a multi-threaded application (say an ASP.NET WebAPI Controller). We should have some story here. Today, in our GitHub issue labeler service, we are deserializing the model on every request because PredictionModel isn't a thread-safe object.

Zruty0 commented 6 years ago

Yes, our PredictionEngine is not thread-safe.

We had a doc somewhere on three possible ways to make thread-safe prediction using PredictionEngine with different repercussions, but I struggle to find it now...

ArieJones commented 6 years ago

+1 for File-based saving of data. I am right now trying to find a way in which I could get the "featurized" data out of the pipeline so that I could possibly feed it into a different system to consume and do some analysis with it.... such as R.

ericstj commented 6 years ago

In addition to export of models, import of some models should work as well. It's also interesting to look at what capabilities are available after import (or even load of a saved model). At a minimum we'd need to be able to predict. It would also be valuable to retrain and evaluate. I can imagine that for some cases of imported data it may not be possible to modify the model in a meaningful way, if that's the case we should be clear about that.

zeahmed commented 6 years ago

Building on what @ericstj said. Since we are moving towards estimators/transformers design, it would be a good idea to reload the estimator pipeline (for retraining) from the saved model (or have a way to save the estimator pipeline and load it back). This will make sharing of pipelines bit easier. Though, it can be done with C# code itself currently.

TomFinley commented 6 years ago

It would be great if we could drill into the purely "inference/prediction" scenarios a bit more. For example, how to do inference in a multi-threaded application (say an ASP.NET WebAPI Controller)

That is a good point @eerhardt , I should mention multi-threaded prediction. (I have edited with an addendum "Multi-threaded prediction".) I am somewhat curious about why this solution of deserializing on every call -- assuredly people that internally use this library to serve web requests don't do that, so I wonder why it was necessary in that case. (Something peculiar to the abstractions adopted by the entry-point-based API, I'm guessing?) Anyway, yes, I should hope we can do better than this.

Is this an adequate description of the thing I should add? If you approve (or have suggestions and clarifications) I'll edit the issue and append it.

  • Multi-threaded prediction. A twist on "Simple train and predict", where we account that multiple threads may want predictions at the same time. Because we deliberately do not reallocate internal memory buffers on every single prediction, the PredictionEngine (or its estimator/transformer based successor) is, like most stateful .NET objects, fundamentally not thread safe. This is deliberate and as designed. However, some mechanism to enable multi-threaded scenarios (e.g., a web server servicing requests) should be possible and performant in the new API.

@ArieJones yes. We have some existing code for this in "savers," (e.g., text saver, binary saver, etc.), but it is currently buried in the Runtime namespaces which we're thinking of exposing. (At least, much of it.)

Next point:

In addition to export of models, import of some models should work as well.

@ericstj ah. So for example, not only export to ONNX/ONNX-ML, but have the ability to somehow incorporate ONNX/ONNX-ML pipelines as well into pipelines. I agree that this would be useful. I'm unsure that I want to put it that directly. You'll note that I've tended to describe scenarios to elucidate what might be troublesome architectural issues with the abstractions in the API, and less about advocating for this or that specific component. (If I had been doing the latter exercise, I certainly would have mentioned incorporating a PyTorch or Tensorflow trainer and predictor. 😄) But perhaps I can put it like this:

  • Expressive transformation: The estimator and transformer architecture should be robust and expressive enough to cover scenarios of wrapping external components, e.g., a transformer capable of pumping a dataset through ONNX-ML models, wrapping TensorFlow.

What do you think @ericstj? Something about my description seems weak, since sufficiently expressive is such a vague requirement that I'm not sure it will be useful for @Zruty0 's task of writing samples to validate the API (whether estimator/transformer based or not).

@zeahmed thanks for your comment! Could you clarify this ask a bit more? While persistable transforms are essential to the framework being useful at all, the scenario for persistable estimators is less clear to me. Retaining information on how a model was trained is certainly important, but I consider this best served by the ideas behind so-called "model management," for which merely saving an estimator/trainer by itself would not necessarily be terribly helpful compared to, say, having the code that did the training actually checked in. But this may simply be my own lack of imagination.

zeahmed commented 6 years ago

Thanks @TomFinley, Yes debuggability of the model is one of the scenario behind saving the estimator pipeline.

However, I also see its use in online training case where it is essential to know which algorithm/estimator was used to generate the model. This is relevant to the following scenario

  • Train with initial predictor: Similar to the simple train scenario, . The scenario might be one of the online linear learners that can take advantage of this, e.g., averaged perceptron.

I feel having the code here is not sufficient because during online training all the pipeline components will be used as-is except for the learner that will be further trained. Maybe you already have a solution for this in mind but I just wanted to point out this scenario in case.

Zruty0 commented 6 years ago

I think 'train with initial prediction' could be just implemented as a version of a constructor to the trainer that takes a 'linear predictor transformer' as one of the arguments and trains on top of it.

In this particular case I don't see a need to memorize the pipeline anywhere.

ericstj commented 6 years ago

@TomFinley I'm looking for less of a promise of what will work and more of a promise of a description of what will work. I would imagine that for some file-formats they behave as a save file where you can resume doing anything with them that you were able to do before save, whereas some export/imports are lossy: you cannot do as much with them once importing. I'm not sure I see the value (or even testability) of an export without import.

Zruty0 commented 6 years ago

It is, in my view, totally fine to have one-way exports. Like 'export predictor as human-readable text'.

ericstj commented 6 years ago

@Zruty0 that's fair, though the end is different. I would imagine we'd want a strong reason for doing 1-way exports if the end is some functionality also supported by ML.NET, otherwise you create a channel for moving people away from ML.NET instead of coming to it.

Zruty0 commented 6 years ago

@TomFinley While I was implementing TrainWithValidationSet and TrainWithInitialPredictor, I discovered that the current low-level API has certain peculiarities about them:

With the estimators-based API it would be possible to disallow all he above three cases at compile-time, if we have a specialized Train method in them (which is what we plan on having).

eerhardt commented 6 years ago

@TomFinley - Your multi-thread write up looks good to me. I've added it to the top post in this issue.

I am somewhat curious about why this solution of deserializing on every call -- assuredly people that internally use this library to serve web requests don't do that, so I wonder why it was necessary in that case. (Something peculiar to the abstractions adopted by the entry-point-based API, I'm guessing?) Anyway, yes, I should hope we can do better than this.

The reason is because that solution is using the PredictionModel<,> class, which wraps a BatchPredictionEngine<,> object, which isn't thread safe.

Zruty0 commented 6 years ago

Another batch of findings from the implementation work:

Zruty0 commented 6 years ago

I think that this issue can safely be closed now.