dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

Direct API: Static Typing of Data Pipelines #632

Closed TomFinley closed 6 years ago

TomFinley commented 6 years ago

Currently in all iterations of the pipeline concept, whether they be based on the v0.1 idiom of LearningPipeline, or the #371 proposal where IDataView is directly created, or the refinement of that in #581, or the convenience constructors, or whatever, there is always this idea of a pipeline being a runtime-checked thing, where each stage has some output schema with typed columns indexed by a string name, and all of this is known only at runtime -- at compile time, all the compiler knows is you have some estimator, or some data view, or something like that, but has no idea what is in it.

This makes sense from a practical perspective, since there are many applications where you cannot know the schema until runtime. E.g.: loading a model from a file, or loading a Parquet file, you aren't going to know anything until the code actually runs. So we want the underlying system to remain dynamically typed to serve those scenarios, and I do not propose changing that. That said, there are some definite usability costs:

It's sort of like working with Dictionary<string, object> as your central data structure, and an API that just takes Dictionary<string, object> everywhere. In a way that's arbitrarily powerful, but the language itself can give you no help at all about what you should do with it, which is kind of a pity since we have this nice statically typed language we're working in.

So: a statically typed helper API on top of this that was sufficiently powerful would help increase the confidence that if someone compiles it might run, and also give you some help in the form of proper intellisense of what you can do, while you are typing before you've run anything. Properly structured, if you had strong typing at the columnar level, nearly everything you can do can be automatically discoverable through intellisense. The documentation would correspondingly become a lot more focused.

The desire to have something like this is very old, but all prior attempts I recall ran into some serious problems sooner or later. In this issue I discuss such an API that I've been kicking around for a little bit, and at least so far it doesn't seem to have any show-stopping problems, at least so far as I've discovered in my initial implementations.

The following proposal is built on top of #581. (For those seeking actual code, the current exploratory work in progress is based out of this branch, which in turn is a branch based of @Zruty0's branch here.)

Simple Example

It may be that the easiest way to explain the proposal is to show a simple example, then explain it. This will be where we train sentiment classification, though I've simplified the text settings to just the diacritics option

// We load two columns, the boolean "label" and the textual "sentimentText".
var text = TextLoader.Create(
    c => (label: c.LoadBool(0), sentimentText: c.LoadText(1)),
    sep: '\t', header: true);

// We apply the text featurizer transform to "sentimentText" producing the column "features".
var transformation = text.CreateTransform(r =>
    (r.label, features: r.sentimentText.TextFeaturizer(keepDiacritics: true)));

// We apply a learner to learn "label" given "features", which will in turn produce
// float "score", float "probability", and boolean "predictedLabel".
var training = transformation.CreateTransform(r =>
    r.label.TrainLinearClassification(r.features))

An alternative is we might do a continuous, non-segmented form (where they are all merged into a single thing):

var pipeline = TextLoader.Create(
    c => (label: c.LoadBool(0), sentimentText: c.LoadText(1)),
    sep: '\t', header: true)
    .ExtendWithTransform(r => (r.label, features: sentimentText.TextFeaturizer(keepDiacritics: true)))
    .ExtendWithTransform(r => r.label.TrainLinearClassification(r.features));

or even the following:

var pipeline = TextLoader.Create(c =>
    c.LoadBool(0).TrainLinearClassification(c.LoadText(1).TextFeaturizer(keepDiacritics: true)));

Developer Story

Here's how I imagine this playing out by someone maybe like me. So: first we have this TextLoader.Create method. (Feel free to suggest better names.)

image

Given that setup, there is I think only one thing here that cannot be plausibly discovered via intellisense, or the XML docs that pop up with intellisense, and that is the fact that you would want to start the pipeline with something like TextLoader.Create. But I figure this will be so ubiquitous even in "example 1" that we can get away with it. There's also the detail about training happening through a "label," and unless they happen to have the right type (Scalar<bool>) it simply won't show up for them. But someone reading documentation on the linear classifier would surely see that extension method and figure out what to do with it.

More Details

Now we drill a little bit more into the shape and design of this API.

PipelineColumn and its subclasses

As we saw in the example, many transformations are indicated by the type of data. For this we have the abstract class PipelineColumn, which are manifested to the user through the following abstract subclasses.

ValueTuples of PipelineColumns

The pipeline are the smallest granularity structures. Above that you have collections of these representing the values present at any given time, upon which you can apply more transformations. That value, as mentioned earlier, is a potentially nested value tuple. By potentially nested, what I mean is that you can have as many ValueTuples as you want. So all of the following are fine, if we imagine that a, b, and c are each some sort of PipelineColumn:

(a, b)
(a, x: (b, c))
a

In the first case the actual underlying data-view, when produced, would have two columns named a and b. In the second, there would be three columns, a, x.b, and x.c. In the last, since there is no way as near as I can tell to have a named ValueTuple<>, I just for now picked the name Data. (Note that in the case where value-tuples are present, the names of the items become the names of the output columns in the data-view schema.)

The reason for supporting nesting is, some estimators produce multiple columns (notably, in the example, the binary classification trainer produces three columns), and as far as I can tell there is no way to "unpack" a returned value-tuple into another value-tuple. Also it provides a convenient way to just bring along all the inputs, if we wanted to do so, by just assigning the input tuple itself as an item in the output tuple.

The Pipeline Components

At a higher level of the columns, and the (nested) tuples of columns, you have the objects that represent the pipeline components that describe each step of what you are actually doing with these things. That is, those objects mappings into those value tuples, or between them. To return to the example with text and transformation and training, these have the following types, in the sense that all the following statements in code would be true:

text is DataReaderEstimator<IMultiStreamSource,
    (Scalar<bool> label, Scalar<string> sentimentText)>;

transformation is Estimator<
    (Scalar<bool> label, Scalar<string> sentimentText),
    (Scalar<bool> label, Scalar<float> features)>;

training is Estimator<
    (Scalar<bool> label, Scalar<float> features),
    (Scalar<float> score, Scalar<float> probability, Scalar<bool> predictedLabel)>;

and also in those "omnibus" equivalents;

pipeline is DataReaderEstimator<IMultiStreamSource,
    (Scalar<float> score, Scalar<float> probability, Scalar<bool> predictedLabel)>;

One may note that the statically-typed API is strongly parallel to the structures proposed in #581. That is, for every core structure following the IEstimator idiom laid out in #581, I envision a strongly typed variant of each type. In the current working code, in fact, the objects actually implement those interfaces, but I might go to having them actually wrap them.

Like the underlying dynamically typed objects, they can be combined in the usual way to form cohesive pipelines. So for example: one could take a DataReaderEstimator<TIn, TA> and an Estimator<TA, TB> to produce a DataReaderEstimator<TIn, TB>. (So for example, when I was using ExtendWithTransform instead of )

This duality is deliberate. While the usage of the static estimators will necessarily not resemble the dynamically typed estimators, based as it is on actual .NET types and identifiers, the structure that is being built up is an estimator based pipeline, and so will resemble it structurally. This duality enables one to use static-typing for as long as is convenient, then when done drop back down to the dynamically typed one. But you could also go in reverse, start with something dynamically typed -- perhaps a model loaded from a file -- essentially assert that this dynamically typed thing has a certain shape (which of course could only be checked at runtime), and then from then on continue with the statically-typed pipe. So as soon as the static typing stops being useful, there's no cliff -- you can just stop using it at that point, and continue dynamically.

However if you can stay in the statically typed world, that's fine. You can fit a strongly typed Estimator to to get a strongly typed Transformer. You can then further get a strongly typed DataView out of a strongly typed Transformer. In the end this is still just a veneer, kind of like the PredictionEngine stuff, but it's a veneer that has a strong likelihood of working.

One or Two Implementation Details

The following is not something that most users will need to concern themselves with, and we won't go into too many details. However at least a loose idea of how the system works might help clear up some of the mystery.

The Scalar<>, Vector<>, etc. classes are abstract classes. The PipelineColumns that are created from the helper extension methods have actual concrete implementations intended to be nested private classes in whatever estimator they're associated with. A user never sees those implementations. The component author is responsible for calling the protected constructor on those objects, so as to feed it the list of dependencies (what PipelineColumn it needs to exist before it would want to chain its own estimator), as well as a little factory object for now called a "reconciler" that the analyzer can call once it has satisfied those dependencies.

The analyzer itself takes the delegate. It constructs the input object, then pipes it thorugh the delegate. In the case of the estimator, these are not the ones returned from any prior delegate (indeed we have no requirement that there be a prior delegate -- estimators can function as independent building blocks), but special instances made for that analysis task). The resulting output will be a value-tuple of PipelineColumns, and by tracing back the dependencies, until we get the graph of dependencies.

The actual constructed inputs have no dependencies, and are assumed to just be there already. We then iteratively "resolve" dependencies -- we take all columns that have their dependencies resolved, and take some subset that all have the same "reconciler." That reconciler is responsible for returning the actual IEstimator. Then anything that depends on that column gets resolved. And so on.

In this way these delegates are declarative structures. Each extension method provides these PipelineColumn implementations, which as objects, but it is the analyzer that goes ahead and figures out in what sequence those factory methods will be called, with what names, etc.

It might be more clear if we saw that actual engine.

https://github.com/TomFinley/machinelearning/blob/8e0298f64f0a9f439bb83426b09e54967065793b/src/Microsoft.ML.Core/StrongPipe/BlockMaker.cs#L13

The system mostly has fake objects everywhere as standins right now just to validate the approach, so for example if I were to actually run the code in the first example, I get the following diagnostic output. (It should be relatively easy to trace back the diagnostic output.)

Called CreateTransform !!!
Using input with name label
Using input with name sentimentText
Constructing TextTransform estimator!
    Will make 'features' out of 'sentimentText'
Exiting CreateTransform !!!

Called CreateTransform !!!
Using input with name label
Using input with name features
Constructing LinearBinaryClassification estimator!
    Will make 'score' out of 'label', 'features'
    Will make 'probability' out of 'label', 'features'
    Will make 'predictedLabel' out of 'label', 'features'
Exiting CreateTransform !!!

If I had another example, like this:

var text = TextLoader.Create(
    ctx => (
    label: ctx.LoadBool(0),
    text: ctx.LoadText(1),
    numericFeatures: ctx.LoadFloat(2, 9)
    ));

var transform = text.CreateTransform(r => (
    r.label,
    features: r.numericFeatures.ConcatWith(r.text.Tokenize().Dictionarize().BagVectorize())
    ));

var train = transform.CreateTransform(r => (
    r.label.TrainLinearClassification(r.features)

then the output looks a little something like this:

Called CreateTransform !!!
Using input with name label
Using input with name numericFeatures
Using input with name text
Constructing WordTokenize estimator!
    Will make '#Temp_0' out of 'text'
Constructing Term estimator!
    Will make '#Temp_1' out of '#Temp_0'
Constructing KeyToVector estimator!
    Will make '#Temp_2' out of '#Temp_1'
Constructing Concat estimator!
    Will make 'features' out of 'numericFeatures', '#Temp_2'
Exiting CreateTransform !!!

Called CreateTransform !!!
Using input with name label
Using input with name features
Constructing LinearBinaryClassification estimator!
    Will make 'score' out of 'label', 'features'
    Will make 'probability' out of 'label', 'features'
    Will make 'predictedLabel' out of 'label', 'features'
Exiting CreateTransform !!!

You can sort of trace though what the analyzer is doing as it resolves dependencies, constructs IEstimators, etc. etc. (Obviously the real version won't have all those little console writelines everywhere.)

Stuff Not Covered

There's a lot of stuff I haven't yet talked about. We create these blocks, how do we mix and match? What does the strongly typed Transformer or DataView look like? We talked about the text loader, what about sources that come from actual .NET objects? These we might cover in future editions on this, or in subsequent comments. But I think perhaps this writing has gone on long enough...

/cc @Zruty0 , @ericstj , @eerhardt , @terrajobst , @motus

Zruty0 commented 6 years ago

639 has a mock-up of the 'getting started' example that takes advantage of this proposal.

TomFinley commented 6 years ago

There are a couple technical issues that have cropped up. I would like to expand on them, and suggest possible resolutions to them.

Operations only in context

One issue is the issue of context. We have these PipelineColumn objects, where you can through extension methods declared in the appropriate estimators, indicate that you want a certain operation. But we will be using the shape types and using them in multiple contexts, and sometimes you should not be able to do certain operations on them... like once fit, you should not be able to apply more estimators, because it has already been fit.

If it were a simple matter to take the shape types, that is, these possibly recursive value tuples, and produce equivalent types, just with the item types changed (that is, when fitting, something like (Scalar<int> a, Vector<float> b) would become (Fit<Scalar<int>> a, Fit<Vector<float>> b)), but this does not appear to be possible, short of code-generation. So let us suppose that the type-shape parameters for the typed estimators must be the same as the typed transformers we get out of them, and then the typed data-views we get out of the transformers.

The extension methods on Scalar<int> for declaring the estimators we want to apply, for example, we would want only in the appropriate context where we are declaring an estimator, as in the call to CreateTransform. If it were possible to do the above sort of "item-type-mapping," then we could simply do so and the extension methods would obviously simply not be declared for Fit<> or Data<> containers of columns, but simply the columns themselves.

We could also solve the problem by having, as we have for the TextLoader example, some sort of "context" object whenever it is possible to write estimators, and all only-estimator-appropriate extension methods would require that object, which would only be furnished whenever possible. This is a way to work within the existing type system, but it is verbose and obnoxious since that argument really actually serves no purpose whatsoever, except to make calling the estimator-extension methods possible. (This as opposed to the context objects for text loader, which serve and immediate and useful purpose.)

My preferred solution at this moment might be to incorporate this checking into a Roslyn analyzer. We must already write a Roslyn code analyzer to solve the type problem of how to ensure that the type declared from the user is valid, assuming we want errors of that form to be caught at compile time (which is of course the point of this work), so perhaps it can catch this too.

Introspection

In #584 there is the issue of "introspective training," that is, people often want to see what was trained, as opposed to just treating models as black boxes. For example:

Etc. The point is that models, far from being back boxes, often in the course of analyzing their input data will acquire data people will want to know. Indeed for some components like LDA and clustering, they are almost useless except for the ability to provide this data.

For this reason, the code for estimators as described in #581 and as we see here, has a type parameter:

https://github.com/dotnet/machinelearning/blob/60ae981223e83b174ecaaf528bd51814a6b0835c/src/Microsoft.ML.Core/Data/IEstimator.cs#L211

This is all very well and good, but it does present a problem in this work. So, when the Fit method is called on this, we will have a strongly typed transformer, so we can then get the weights out of it, and so on. Yet, what is this statically-typed pipeline to do? We have a rather declarative syntax. If we say CreateTransform(r => (C: r.A.Op1(), D: r.B.Op2()), it is unclear whether the estimator that is actually applied is that corresponding to either Op1 or Op1... or neither, because the CopyColumns estimator could also be applied. Also even if it were possible to state this definitively, we probably would not want to, since this sort of declarative environment where people like @interesaaat can do good work. So even if users somehow knew what the underlying types were, it would not be safe for them to try to cast to these types! For that reason, the estimator as returned from CreateTransform is an IEstimator<ITransformer>, nothing more specific!

Yet this is so perverse! The mechanism by which we deliver this static typed checking of pipelines simultaneously makes it difficult for us to exploit the benefit we achieved by making IEstimator a generic parameter in the first place. Nor can we possibly rely on the type returned from the estimators -- the analysis step for various reasons requires that it be one of the base types of the PipelineColumn (since it has to instantiate synthetic instances of it to feed whatever "next steps" may arise), and even if this restriction did not exist the thing is only an estimator.

If we cannot rely on the type returned from these extension methods, and we cannot know anything really about the pipeline itself since the estimators may be applied in any sensible order, then the only possible choice I think is we must rely on something passed into those extension methods. The most obvious choice is a delegate, perhaps on a parameter we will canonically name onFit, that is called whenever Fit is called on the estimator chain... in the case of an affine normalizer, it might be something like this: Action<(VBuffer<float> slope, VBuffer<float)> offset> onFit. (We'd probably actually declare a delegate type for this.) Obviously this must always be optional, for those cases where we do not care.

Normalization

When @Zruty0 was working in #716, it became clear that the scheme for having trainers auto-cache and auto-normalize had fairly deep implications for the type signatures of trainers, the result of training, and so forth, that severely complicated the API -- which ran afoul of the intentions of the effort in the first place which was to make things simpler and not harder for users.

There is also a philosophical point that I subscribe to, that APIs that try to outsmart their users are inherently evil, and ultimately lead to more confusion, not less -- as Python's PEP 20 puts it, explicit is better than implicit. Yet agreement on this point is not yet total.

We have already discussed the need for a Roslyn analyzer to ensure the correct usage of the idioms in this proposal, and probably in ML.NET more broadly. The thinking, at least at the time I write this, is that trainers will on the whole accept Vector<> types, but they can also signal via an attribute that they want normalized data. But what can we do with that?

In the initial sketches of this we have already established that metadata that has the potential to affect types in the schema ought to have PipelineColumn subclasses of their own -- hence things like Key<T, TVal>. For this reason, we also accept NormVector<T>. Trainers with parametric handling of their input values can accept Vector<T>, but if the instance passed in is anything other than a NormVector<T> the library's Roslyn analyzer can provide a warning with suggested remediations.

Mappings into and out of typed structures

Key to the idioms in ML.NET is the idea that once you train a model, you can have a typed version of it.

It is of course essential that we be able to take some structure, like say an IEnumerable<T>, and somehow map it into a data view, in a way that is somehow statically typed. This is uncontroversial -- we already have a mechanism for non-statically typed ways that works via the so-called PredictionEngine, which is generic, but it relies on reflection driven analysis of types.

The input problem where inputting from an IEnumerable<T> can be done like with the text loader, except parameterized with a delegate -- that is, a context can produce a Scalar<int> by, perhaps, providing a Func<T, int>.

What is more controversial is the point that we will want to use this scheme for the output of data -- since it is assumed by some people that we will mostly want to do this output when doing prediction, which will most of the time deal with loading models which means we cannot necessarily assume anything about the types of the objects we are loading. However, let us suppose for a moment that we will have to solve both problems, since the solution to both resemble each other.

In addition to being lower priority, it is also as it happens considerably harder to achieve the "mapping out" scenario at least, conceptually I am having difficulty imagining a graceful way to map the shape tuple into an output object. As yet I don't see a clean interface for this.

Wrapping vs. implementation

Under the current prototype code, the statically typed objects are instances of the dynamically typed interfaces... for example, Estimator<TTupleInShape, TTupleOutShape, TTransformer> is also a IEstimator<TTransformer>. The implication of that though is, that that object must explicitly implement the interface, so as to avoid confusion, name collisions, and whatnot.

The alternative to this is that these statically typed objects do not implement the interface, but instead can return via a property the dynamically typed object they are wrapping, if someone needs to go to that level.

It may be necessary to go that "wrapping" and not "implementing" route instead, because there are convenience extension methods on the interfaces themselves that will be confusing to people.

Code name for this work

As a notice to prevent any possible confusion: Internally at Microsoft @Zruty0 and I jokingly came up with the descriptor "PIpelines with Generic Static TYpes," that is, PiGSTy, as a name for this work. In case the name "Pigsty" gets thrown around in other discussions, this broad effort is what it refers to.