dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Why do I need to add a property to my class to get Score? #1991

Closed pekspro closed 5 years ago

pekspro commented 5 years ago

System information

Issue

In the GitHubLabeler sample application the input data is defined like this:

internal class GitHubIssue
{
    public string ID;
    public string Area; // This is an issue label, for example "area-System.Threading"
    public string Title;
    public string Description;
}

And the class to get the prediction is defined like this:

internal class GitHubIssuePrediction
{
    [ColumnName("PredictedLabel")]
    public string Area;
}

I think it makes sense that you need to define the prediction class yourself since I guess it could contain any kinds of properties and types.

If you have the model loaded and a model prediction function created it’s easy to get a prediction:

GitHubIssue singleIssue = new GitHubIssue() { ID = "Any-ID", Title = "Entity Framework crashes", Description = "When connecting to the database, EF is crashing" };

//Predict label for single hard-coded issue
//Score
var prediction = _predFunction.Predict(singleIssue);

So far everything makes sense. But if you want to get scores of the prediction you need to change the prediction class to:

internal class GitHubIssuePrediction
{
    [ColumnName("PredictedLabel")]
    public string Area;

    [ColumnName("Score")]
    public float[] Score { get; set; }
}

Personally, I find this to be very unintuitive. Can’t PredictionFunction.Predict instead return an object like this instead?

public sealed class PredictionResult<TDst>
    where TDst : class, new()
{
    public float[] Score { get; set; }

    public TDst Prediction { get; set; }
}

Where TDst would be GitHubIssuePrediction in this specific case. Or maybe something more fancy like this:

public sealed class PredictionResult<TDst>
    where TDst : class, new()
    {
        public float PredictionScore { get; set; }

        public TDst Prediction { get; set; }

        public float[] ScoresForAllCategories { get; set; }
    }

As mentioned in #1881, it would be very useful to map a score to a specific category too.

Zruty0 commented 5 years ago

I think it makes sense that you need to define the prediction class yourself since I guess it could contain any kinds of properties and types.

This is completely correct, and, in fact, this is THE reason we made our prediction API this way.

Think of a model as a black box that takes a 'property bag' (the 'example'), and produces a different 'property bag' (same example, but with extra fields added that we assume are 'predictions').

If ML.NET was written in JavaScript, or Python, we would encode these above 'property bags' as 'objects'. But, because we are operating in a strongly typed language, we went further and encoded them as proper classes (and we use efficient codegen to elide type checks and populate the necessary fields lightning-fast).

But how can we assume the presence of float[] Scores in ANY output? This contradicts all that was said above! The existence of float[] Scores assumes an extremely specific kind of 'model': the "prediction" model that produces some form of float vector. If we adopted this kind of API, we would preclude a lot of scenarios that are possible right now:

Given that all of the above is possible using the existing API, I think the necessity to add public float[] Score to your output class (which you have to create ANYWAY) a small price to pay.

pekspro commented 5 years ago

Thanks for your reply. This is a new world for me so I’m not sure what is going on inside all of this so it’s really interesting :-)

I totally agree that it’s not a major thing to add a Score property to your class. But that assumes you know that should do this :-) I still see some room for improvements. Since I don’t add the Score property to my input it is a bit weird to add it to the output.

I totally understand that this is not possible to have a Score propery for every model. But wouldn’t it be possible to have a custom prediction function for each “learner”, or whatever thing that adds this property? Maybe some function called MakeMulticlassClassificationPredictionFunction which creates a function that unwraps "fixed" properties that is added by the learner.

I guess I’m mixing up different concepts here, but I hopes it’s clear what I’m asking for :-). This might also solve problems like #1881.

charlesroddie commented 5 years ago

@Zruty0 Is it possible to use ML.Net without the annotation/reflection approach? This would give more type safety and compiler helpfulness.

This would involve using ML.Net objects instead of own annotated objects (similarly to how manual serialization would convert an object to a JsonObject or a CBORObject prior to serialization).

rogancarr commented 5 years ago

@TomFinley or @eerhardt ?

eerhardt commented 5 years ago

@charlesroddie - yes! That is completely possible. It is a little more work, as the reflection approach is the "convenience", but you can easily step one level down and use the IDataView interface to interact with your data - both data going in and coming out of a pipeline.

In this example, we are using ML.NET with Unity. Unity doesn't support System.Reflection.Emit, which ML.NET uses to set the properties on your custom object. So instead, this code is using IDataView directly:

https://github.com/dotnet/machinelearning-samples/blob/2c0d44bcef5f8fe5eda629c979a38d1de8a7fa13/samples/csharp/end-to-end-apps/Unity-HelloMLNET/HelloMLNET/Assets/Scenes/HelloML.cs#L52-L67

Note that the crux of this code is:

var transformedIssues = loadedModel.Transform(testissues);

The Transform method takes in an IDataView, does the transformation (in this case turns the input text into a bool whether it is positive for negative) and then outputs an IDataView.

However, note, you still need to know which columns you are looking for - PredictedLabel and/or Score. But the advantage to the IDataView approach is that you can read its Schema and see all the columns available.

IDataView is the core Data type in ML.NET. You can read about it here: https://github.com/dotnet/machinelearning/blob/master/docs/code/MlNetHighLevelConcepts.md#list-of-high-level-concepts

TomFinley commented 5 years ago

@charlesroddie - yes! That is completely possible. It is a little more work, as the reflection approach is the "convenience", but you can easily step one level down and use the IDataView interface to interact with your data - both data going in and coming out of a pipeline.

This may be something for @sfilipi, @rogancarr, and @shmoradims to look for as as they do samples and documentation, is make sure there's sufficient mention of IDataView.

The question by @charlesroddie though seems to be different though. He's (correct me if I'm wrong @charlesroddie) asking for a non-reflection driven approach that offers more type safety and checks. What makes me think this is the words, "without the annotation/reflection approach? This would give more type safety and compiler helpfulness." IDataView provides less of that, not more -- it is not reflection driven per se, but it is no more type safe than Dictionary<string, object>.

Regarding more type safety, this was the intent behind the so-called "static API" affectionately named "PiGSTy", which we do have some samples for, but it is strongly experimental work that some people dislike strongly due to it using delegates and whatnot to capture operations on data. So for v1 that work will still be very experimental, until we decide how we can improve it, or else people change their minds and suppose chained delegates to represent a data flow isn't actually such a bad idea after all. 😄

codemzs commented 5 years ago

https://github.com/dotnet/machinelearning/tree/master/docs/samples/Microsoft.ML.Samples/Dynamic