dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.03k stars 1.89k forks source link

Expand PiGSTy's schema shape type support #1085

Closed TomFinley closed 6 years ago

TomFinley commented 6 years ago

PiGSTy's (#632) schema propagation system relies on a static type that is currently restricted to be a value-tuple. There is, however, no fundamental reason for it to be value-tuples, or only value-tuples; my motivation was merely just that if we consider the C# syntax for declaring value-tuples vs. anonymous types:

var x = (a: 1, b: 2);
var y = new { A = 1, B = 2 };

one clearly had more elegant syntax than the other. Yet there are reasons to also support classes with PipelineColumn properties. (This includes but is not limited to anonymous classes).

  1. Intellisense for value-tuples, especially in the context of generics and especially especially in the context of delegates, is currently not the best. This may be addressed, or it may not. At any rate even if addressed we would want our API to have a good experience in older versions of Visual Studio.

  2. My possibly imperfect understanding from @dsyme is that F# can work with anonymous types gracefully, but as far as I am aware cannot yet work with value-tuples at all. Considering that the intersection of F# developers and ML.NET potential users is high at least when compared to the general population of .NET developer, this may be viewed as a priority.

  3. For explicitly defined (i.e., non-anonymous) classes, it allows one to define something like DataView<MyData>, which has the advantage of being easily expressable as a return value and a parameter to a function. Could someone write something like DataView<(Scalar<bool> label, Vector<float> features)> as a return value or parameter? Yes, absolutely, I suppose, but it's not very pretty. Also one thing I learnt to my surprise is that, at least as of C# 7.3, one cannot merely do something like using Foo = (int bar, float biz), so even a shorthand based on type aliases even within a single file is not possible.

Currently the set of allowed "schema shape" types is the following:

  1. Any of the public subclasses of PipelineColumn.
  2. A value-tuple whose items are of an allowed "schema shape" type.

What I am proposing is that we expand this definition to include:

  1. Classes whose only publicly accessible members are properties that are one of the basic abstract subtypes of PipelineColumn and a single public constructor in one of the following two subtypes:

    • The properties are get-only, and the single constructor has parameters where a one-to-one correspondence with each of the properties, where the parameter and property have identical types, and this constructor sets the properties, and no other work.

    • The properties have get-set accessors, and the single construct has no arguments. (Note that this can be a default constructor.)

    In each case, whether the property is set via constructor parameter or just by assigning it, the property should always be exactly the object that was either passed in or assigned.

Note that due to the way the validation code works, and for several practical reasons, it is best and easiest to allow a mix of these types -- that is, just as you can have nested value-tuples, you could have, for example, a value-tuple nested in a class, nested in a value-tuple.

I am considering the wisdom of allowing fields. On the one hand, we allow fields on the classes for PredictionEngine. These of course cannot be the same object (one describes values of a column, whereas one describes the properties of the columns themselves and how they are meant to be configured), but it may be that allowing fields in one but not the other may lead to confusion. Not sure about this. For now I am thinking to only allow properties, and if someone complains we can consider adding it then.

Work

We might identify five things that should be done. The first three should probably be in one PR. The fourth can happen in a subsequent PR. The fifth is sort of a "nice to have" but strictly speaking unnecessary. One of those things to be done in someone's copious spare time.

  1. The methods of StaticPipeInternalUtils should be expanded to also accommodate the aforementioned types.

  2. The code analyzer that checks the type parameters with IsShapeAttribute should be similarly expanded.

  3. In several places throughout the code base the generic type parameter has Tuple in its name, e.g., TTupleShape, TTupleInShape, and so forth. These ought to be changed to things like TShape or TInShape and the like to make them less specific to tuples.

  4. Part of the reasons why we are doing this is to make the structure work better in F#. F# support for value tuples is limited. This suggests that those so-called "pigstentions" where we return a set of columns (e.g., binary classification), where we currently have a value-tuple (e.g., (Scalar<float> score, Scalar<float> probability, Scalar<bool> predictedLabel)) we may be better served by instead having a definite type, BinaryClassificationOutput looking something like this:

    public sealed BinaryClassificationOutput
    {
       public Scalar<float> Score { get; }
       public Scalar<float> Probability { get; }
       public Scalar<bool> PredictedLabel { get; }
    
       public BinaryClassificationOutput(...) { ... }
    }
  5. Having a user responsible for defining the types is fairly obnoxious in its own right. So, we might imagine a code action through a Roslyn code-fix provider that can take an Estimator<T> where T is some sort of inline type (either value-tuple or anonymous class), based on that class defines an explicit type (by default perhaps some private nested class), and then uses it to provide Estimator<T2> over this new hypothetical type T2. This would be relatively easy for the user to call, and then they could modify it as they like. (E.g., make it public, move it somewhere else, add XML docs, and so forth.)

dsyme commented 6 years ago

Cc @cartermp this is a rationale for anonymous types in F# 5.0

cartermp commented 6 years ago

@dsyme Anonymous types separately from anonymous records, or just the latter?

Also @TomFinley:

but as far as I am aware cannot yet work with value-tuples at all

F# can work with value tuples today - let (x, y) = struct (1, 2) - but I agree with your rationale for not just constraining the value tuples 👍

dsyme commented 6 years ago

@cartermp The issue is that F# tuples are positional, and don't used named meta data.

Yes, I meant anonymous records. This kind of strongly typed data scripting is one of the rationales. We should go ahead with the feature and validate it using this work

cartermp commented 6 years ago

Makes total sense. I'll link this issue in the RFC issue for Anonymous Records and see if there is anything to add in the motivation section of the RFC itself.