TomFinley commented 6 years ago

In this issue we describe a proposal to change the API. The core of the proposal is, instead of working via the entry-point runtime abstraction lying on top of the implementing code, we encourage people to use the implementing code directly.

Current State

Within ML.NET, for a component to be exposed in the "public" API, a component author follows the following steps (from an extremely high level):

The author writes a component, implementing some sort of central interface. Often this is something like IDataLoader, IDataTransform, ITrainer, or some other such type of object.
An "entry-point" wrapping object is created for that component. This is a purely functional view of components as having inputs (as fields in some sort of input class) and outputs (as fields in some sort of output class). This is decorated with attributes, to allow the dependency injection framework to do its work.
A JSON "manifest" describing all such components is created, through some process involving a scan of all .dlls and the aforementioned attributes.
Some other code reads this JSON "manifest" and out of it generates a number of C# classes. (This process being the code in CSharpApiGenerator.cs, the artifact of which is described in CSharpApi.cs.)

A user then works with this component in the following fashion.

The user constructs a LearningPipeline object.
They adds implementations of ILearningPipelineItem, which are sort of configuration objects. (These are some of the objects that were code generated.)
Through some process that is probably too complex to describe here, these ILearningPipelineItem are transmuted into a sort of abstract "graph" structure comprised of inputs and outputs. (This is an "entry-point" experiment graph.)
This graph structure is then serialized to JSON, de-serialized back out of JSON, then the actual underlying code that implements the operations is loaded using dependency injection.
Once loaded, the associated "settings" objects (which are actual types explicitly written in ML.NET) have their fields populated from values in this JSON.
There is some higher level runtime coordinating this process of graph nodes (the entry-point graph runner). This is a sort of runtime for the nodes, and handles job scheduling, variable setting, and whatnot.

The way this process works is via something called entry-points. Entry-points were conceived as a mechanism to enable a "regular" way to invoke ML.NET components from native code, that was more expressive and powerful than the command line. Essentially: they are a command-line on steroids, that instead of inventing a new DSL utilizes JSON. This is effective at alleviating the burden of writing "bridges" from R and Python into ML.NET. It also has advantages in situations where you need to send a sequence of commands "over the wire" in some complex fashion. While a few types would need to be handled (e.g., standard numeric types, IDataView, IFileHandle, and some others), so long as the entry-points used only those supported types, composing an experiment in those non-.NET environments would be possible.

Possible Alternate State

Instead of working indirectly with ML.NET components through the entry-point abstraction, you could just instantiate and use the existing classes directly. That is, the aforementioned IDataLoader, IDataTransform, ITrainer, and so forth would be instantiated and operated on directly.

While entry-points would still be necessary for any components we wished to expose through R or Python, we would constrain our usage to those applications where the added level of abstraction served some purpose.

This alternate pattern of usage is already well tested, as it actually reflects how ML.NET itself is written.

Changes for ML.NET

In order to move towards this state, a few high level adjustments will be necessary.

Low level API is based direct instantiations of IDataViews/ITrainer and other fundamental types and utilities already used within ML.NET code.
We will work to actively identify and improve that low level API from the point of view of usage. See the sequel for more in depth discussion of this point.
Writing higher level abstractions to make things easier should be encouraged, however always with the aim of making them non-opaque. That is, in edge cases when the abstraction fails, integrating what can be done with the abstraction with the lower level explicit API should be possible. Generally: Easy things should be easy and hard things should be possible.
To clarify: We are not getting rid of entry-points, because it remains the mechanism by which interop from non-.NET programming environments into TLC will continue to happen, and is therefore important. The shift is: the lower level C# API will not use entry-points. For the purpose of servicing GUI/Python/non-.NET bindings, we will continue in our own code to provide entry points, while allowing user code to work by implementing the core interfaces directly.

Examples of Potential Improvements in "Direct Access" API

We give the following concrete examples of areas that probably need improvement. The examples are meant to be illustrative only. That is: the list is not exhaustive, nor are specific "solutions" to problems meant to convey that something must be done in a particular way.

Instantiation of late binding components was previously always done via dependency injection. Therefore, all components have constructors or static create methods that have had identical signatures (e.g., for transforms, IHostEnvironment env, Arguments args, IDataView input). Direct instantiation by the user could use that, but would doubtless be better served by a more contextually appropriate constructor that reflects common use-cases. For example, this:
```
IDataTransform trans = new ConcatTransform(env, new ConcatTransform.Arguments()
{
  Column = new[] {
  new ConcatTransform.Column()
  {
      Name = "NumericalFeatures",
      Source = new[] { "SqftLiving", "SqftLot", "SqftAbove",   "SqftBasement",
          "Lat", "Long", "SqftLiving15", "SqftLot15" }
  }}
}, loader);
```
may become this:
```
IDataTransform trans = new ConcatTransform(env, loader, "NumericalFeatures",
  "SqftLiving", "SqftLot", "SqftAbove", "SqftBasement", "Lat", "Long",
  "SqftLiving15", "SqftLot15");
```
This can work both ways: if these objects are directly instantiated, the objects could provide richer information than merely being an IDataTransform, or what have you. Due to working via the command line, entry-points, or a GUI, it is considered almost useless for a component to have any purely programmatic access. So for example: we could have had the AffineNormalizer expose its slope and intercept, but we instead expose it by metadata instead. A direct accessor in ML.NET may be appropriate if we directly use these components.
Creating a transform and loader feels similar. However, creating a trainer, using it to provide a predictor, and then ultimately parameterizing a scorer transform with that predictor. Where possible we can try to harmonize the interfaces to make them seem more consistent. (Obviously not always possible since the underlying abstraction may in fact be genuinely different.)
Some parts of the current library introduce needless complexity: Train method on trainer is void, always followed by CreatePredictor. Other incidents of needless complexity may be less easy to resolve.
Some parts of the current library introduce needful complexity, but could probably be improved somehow. RoleMappedData creation and usage, while providing an essential service ("use this column for this purpose"), is incredibly difficult to use. When it was just an "internal" structure we just sort of dealt with it, but we would like to improve it. (In some cases we can hide its creation into auxillary helper methods, for example.)
Simple things like improving naming of things may just help a lot. For example: ScoreUtils.GetScorer returns a transform with the predictor's scores applied to data. ScoreUtils.GetScoredData or something may be a better name.
Our so-called "internal" methods do not always direct people towards pits of success. For example: some pipeline components should probably apply only during training (e.g., filtering, sampling, caching). Some distinction or other engineering nicety (e.g., have the utilities for saving models throw by default) may help warn people off this common misuse case.
Components of the existing API that deal with late-binding/dependency-injection stuff could potentially use delegates or something like entry-point style factory interfaces instead. This means among other things lifting out things like SubComponent from most code. Whether these delegates happen to be composed from the command line parser calling SubComponent.CreateInstance, or some entry-point "subgraph" generating a delegate out of its own graph, is the business of the command line parser and entry-point engine, not the component code itself. (Maybe the delegate just calls Run graph or something then binds the values.)

So for example what is currently this:
```
new Ova(env, new Ova.Argumnets() { Trainer = new SubComponent("sdcaR") );
```
might become this:
```
new Ova(env, host => new SdcaRegression(host));
```
When we think about transform chains and pipelines, both the existing and suggested systems have a need for an intermediate object capable of representing a pipeline before it is instantiated. That intermediate form must be something you can reason over, both to pre-verify pipelines, as well as for certain applications like suggested transforms/auto-ML. One example is issue #267.

Entry-points were an intermediate object, but being logically only JObjects you could not get rich information about what or how they would operate. (Given a pipeline in entry-points you could tell that something might be outputting a IDataView, for example, but have no information about what columns were actually in that output.)

This suggests that the API will want something like LearningPipeline, though I am quite confident LearningPipeline is an incorrect level of abstraction. (See the previous point about opaque abstractions, among other points.)

Note that many of these enhancements will serve not only users, but component authors (including us), and so improve the whole platform.

Miscellaneous Details

Note that C# code generation from entry-point graphs will still be possible: all entry-point invocations come down to (1) defining input objects, (2) calling a static method and (3) doing something with the output object. However it will probably not be possible to make it seem "natural" any more than an attempt to do code-generation from a mml command line would seem "natural."

When we decided to make the public facing API entry-points based, this necessarily required shifting related infrastructure (e.g., GraphRunner, JsonManifestUtils) into more central assemblies. Once that "idiom" is deconstructed, this infrastructure should resume its prior state of being in an isolated assembly.

Along similar lines of isolation, once we shift the components to not use SubComponent directly, we can "uplift" what is currently the command line parsing code out into a separate assembly.

justinormont commented 6 years ago

What would a longer Loader/Transforms/Learner pipeline look like?

For example, what would the, slightly longer, SentimentPredictionTests.cs look like in the new form?

Core of example:

var pipeline = new LearningPipeline();

pipeline.Add(new Data.TextLoader(dataPath)
{
    Arguments = new TextLoaderArguments
    {
        Separator = new[] { '\t' },
        HasHeader = true,
        Column = new[]
        {
            new TextLoaderColumn()
            {
                Name = "Label",
                Source = new [] { new TextLoaderRange(0) },
                Type = Data.DataKind.Num
            },

            new TextLoaderColumn()
            {
                Name = "SentimentText",
                Source = new [] { new TextLoaderRange(1) },
                Type = Data.DataKind.Text
            }
        }
    }
});

pipeline.Add(new TextFeaturizer("Features", "SentimentText")
{
    KeepDiacritics = false,
    KeepPunctuations = false,
    TextCase = TextNormalizerTransformCaseNormalizationMode.Lower,
    OutputTokens = true,
    StopWordsRemover = new PredefinedStopWordsRemover(),
    VectorNormalizer = TextTransformTextNormKind.L2,
    CharFeatureExtractor = new NGramNgramExtractor() { NgramLength = 3, AllLengths = false },
    WordFeatureExtractor = new NGramNgramExtractor() { NgramLength = 2, AllLengths = true }
});

pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });
pipeline.Add(new PredictedLabelColumnOriginalValueConverter() { PredictedLabelColumn = "PredictedLabel" });

var model = pipeline.Train<SentimentData, SentimentPrediction>();

TomFinley commented 6 years ago

Hi @justinormont . Here is an example "sketch" of code not exactly quite what you are talking about, but a different scenario. It probably gives you the idea though.

Crucially, there will be no LearningPipeline object -- transforms already by themselves form a pipeline, so the introduction of pipeline was a needless complication that serves no real purpose, except as an intermediate complicating layer that made lots of things awkward and impossible.

pkulikov commented 6 years ago

@TomFinley

Crucially, there will be no LearningPipeline object -- transforms already by themselves form a pipeline, so the introduction of pipeline was a needless complication that serves no real purpose

Does it help to think about it as dataflow blocks? Is it a good analogy? Though as I see from the example, separate transform steps are 'connected' at the construction time: the previous step is supplied through the constructor of a transform.

zeahmed commented 6 years ago

Thanks @TomFinley, the proposed changes seem reasonably good.

What would be the impact of new changes for end-users who are already using LearningPipeline object?

codemzs commented 6 years ago

@zeahmed There will not be a LearningPipeline object. The users will have to adapt to the new API.

codemzs commented 6 years ago

@TomFinley The example sketch looks great. Can't wait for the new API to be rolled out.

jwood803 commented 6 years ago

I'm liking how the example looks as well. I'm guessing this may be started on after version 0.3 gets released?

isaacabraham commented 6 years ago

This sounds good in principle - removing unnecessary abstractions always a good thing. I wonder how this will work alongside F#. cc @mathias-brandewinder

TomFinley commented 6 years ago

Hi @pkulikov -- maybe. LINQ is I feel maybe a little closer in intent and structure, but all three idioms have in common that the relation between items is explicit... i.e., in TPL Dataflow you have LinkTo between blocks, in LINQ you pass the source in as the this param of the extension method, and in these transforms in ML.NET they tend to be passed in via the constructor or static create method. In all cases, there's an explicit connection.

@zeahmed aside from the absence of this container "pipeline" idiom, my expectation is that some confusion will arise because when you call new the thing is actually trained, because it needs to create its schema.

@jwood803 yes, I think that was the idea. More specifically, I guess the idea is, I think I or whoever else is working on this would take the stuff in Microsoft.ML.Runtime, use it directly, then try our best to smooth over any awkward bits. Some parts are already pretty obvious, as seen in the list in this issue.

@isaacabraham hmmm. This is something I've heard before. Hopefully we'll have some review from an F# perspective. Some of the idioms around buffer sharing in particular seem very non-idiomatically F#.

KrzysztofCwalina commented 6 years ago

In general I LOVE the proposal. I do have two comments though:

Re "We are not getting rid of entry-point": It would be good to make it explicit that the entry-point like support will be in a separate dll.
I don't think we should totally get rid of the LearningPipeline. I totally agree that "Writing higher level abstractions to make things easier should be encouraged, however always with the aim of making them non-opaque. That is, in edge cases when the abstraction fails, integrating what can be done with the abstraction with the lower level explicit API should be possible." But this to me means the learning pipeline abstraction needs to be designed correctly, not necessarily removed. And here is how I would morph the abstraction: the LearningPipeline should be an API to infer and tweak a pipeline composed from the low level components. i.e. based on input schema it would create a pipeline and then let the developers tweak the piplene (from the default guess) to something better. This is similar how one of the old UI tools @glebuk showed me works.

codemzs commented 6 years ago

@KrzysztofCwalina I think you are referring to Machine Learning Recipes that auto creates a pipeline for the user by inferring the dataset schema. I wrote that piece of code and it creates the pipeline by using low level API as shown in @TomFinley 's example and by applying some heuristics to auto featurize the columns and then adding an appropriate learner in the end. You are right it will be a great convenience for the users to quickly get started with ML and then they can fine tune. This would be a layer on top of the low level APIs.

First we need to enable low level APIs by moving the constructors for the components under the correct namespace and as well creating convenience constructors. Then as a step two, we can think about enabling recipes. How does the plan sound?

KrzysztofCwalina commented 6 years ago

@codemzs sounds good to me.

TomFinley commented 6 years ago

Hi @KrzysztofCwalina thank you for taking the time to reply.

Regarding the first point I completely agree. See the second-to-last paragraph that begins "When we decided to make the public facing API entry-points based..." Despite perhaps minor difference in word choice (e.g., I say "an isolated assembly," you say "a separate DLL") I think our agreement on this point is total.

Regarding pipelines, by which I think you mean an intermediate form representing transforms prior to their instantiation, that's a bit more tricky. I strongly agree that something will need to be done. It's getting from there to a specific plan that gets me; I don't have clear ideas on what a good solution looks like. I'm quite certain the current LearningPipeline isn't it, and generally I'm incredibly suspicious of any solution that claims to solve the problem using code-gen. As you say, this would have to be designed correctly. It is something we must think about though I don't have as clear of a plan here as I do in some other places. I however had added an item to the list, the new one beginning, "When we think about transform chains and pipelines".

The recipe code I guess as mentioned by @codemzs has its own level of an abstraction of a "promise" of what it wants to do, but that level of abstraction is very specific to recipes specifically. That doesn't make it bad -- it works for its purpose -- but I wonder if something more universal is possible. The scenario I'd love to solve is #267, and ideally whatever we use to solve that would, I hope, just be statically typed so many failure scenarios simply do not happen, you get intellisense help for what you can do (hopefully), and stuff like that.

KrzysztofCwalina commented 6 years ago

Sounds good about the DLL separation.

I wonder if something more universal is possible

Let's chat about this when we finish the redesign described in this proposal. The universal helper would build on top of the low level APIs after all.

Zruty0 commented 6 years ago

@TomFinley, I think this issue might get superseded by #581 ?

As in, if we make estimators/transformers the lowest-level user-facing primitives, the need for convenience constructors for loaders, transforms and trainers will be folded into the need to separate estimators for them.

TomFinley commented 6 years ago

Hi @Zruty0 , certainly it is informed by it. I view #581 about dealing with a fundamental change to the infrastructure ("trainers and transforms superceded), whereas this is more like, whatever those fundamental structures are (whether they take their current form or the proposal in #581 or some other thing), they should form the basis of the public API, rather than being opaquely wrapped. Also central to the proposal is that these components should be easy to call. Those two points, the former philosophical, the latter user-facing practical, is not mentioned at all in #581.

TomFinley commented 5 years ago

Hi @Zruty0 (or others), just going over my issues again. I sort of view the key issue here (we ought to just use our components directly rather than working through some odd abstraction layer) well settled to the point where there appears to be no debate any longer, and in the form of the particulars of what those components are there has been enough refinement to the point where I feel like the specific code raised in the issue is no longer useful, since it predates estimators/transformers, and there is now plenty of code that shows that working anyway.

Zruty0 commented 5 years ago

I sort of view the key issue here (we ought to just use our components directly rather than working through some odd abstraction layer) well settled to the point where there appears to be no debate any longer, and in the form of the particulars of what those components are there has been enough refinement to the point where I feel like the specific code raised in the issue is no longer useful, since it predates estimators/transformers, and there is now plenty of code that shows that working anyway.

That was one heck of a long sentence :) I agree with @TomFinley

dotnet / machinelearning

Proposal for Major Change in API #371

Current State

Possible Alternate State

Changes for ML.NET

Examples of Potential Improvements in "Direct Access" API

Miscellaneous Details