Closed pekspro closed 5 years ago
I think it makes sense that you need to define the prediction class yourself since I guess it could contain any kinds of properties and types.
This is completely correct, and, in fact, this is THE reason we made our prediction API this way.
Think of a model as a black box that takes a 'property bag' (the 'example'), and produces a different 'property bag' (same example, but with extra fields added that we assume are 'predictions').
If ML.NET was written in JavaScript, or Python, we would encode these above 'property bags' as 'objects'. But, because we are operating in a strongly typed language, we went further and encoded them as proper classes (and we use efficient codegen to elide type checks and populate the necessary fields lightning-fast).
But how can we assume the presence of float[] Scores
in ANY output? This contradicts all that was said above! The existence of float[] Scores
assumes an extremely specific kind of 'model': the "prediction" model that produces some form of float vector. If we adopted this kind of API, we would preclude a lot of scenarios that are possible right now:
Given that all of the above is possible using the existing API, I think the necessity to add public float[] Score
to your output class (which you have to create ANYWAY) a small price to pay.
Thanks for your reply. This is a new world for me so I’m not sure what is going on inside all of this so it’s really interesting :-)
I totally agree that it’s not a major thing to add a Score property to your class. But that assumes you know that should do this :-) I still see some room for improvements. Since I don’t add the Score property to my input it is a bit weird to add it to the output.
I totally understand that this is not possible to have a Score propery for every model. But wouldn’t it be possible to have a custom prediction function for each “learner”, or whatever thing that adds this property? Maybe some function called MakeMulticlassClassificationPredictionFunction
which creates a function that unwraps "fixed" properties that is added by the learner.
I guess I’m mixing up different concepts here, but I hopes it’s clear what I’m asking for :-). This might also solve problems like #1881.
@Zruty0 Is it possible to use ML.Net without the annotation/reflection approach? This would give more type safety and compiler helpfulness.
This would involve using ML.Net objects instead of own annotated objects (similarly to how manual serialization would convert an object to a JsonObject or a CBORObject prior to serialization).
@TomFinley or @eerhardt ?
@charlesroddie - yes! That is completely possible. It is a little more work, as the reflection approach is the "convenience", but you can easily step one level down and use the IDataView
interface to interact with your data - both data going in and coming out of a pipeline.
In this example, we are using ML.NET with Unity. Unity doesn't support System.Reflection.Emit
, which ML.NET uses to set the properties on your custom object. So instead, this code is using IDataView
directly:
Note that the crux of this code is:
var transformedIssues = loadedModel.Transform(testissues);
The Transform
method takes in an IDataView
, does the transformation (in this case turns the input text into a bool whether it is positive for negative) and then outputs an IDataView
.
However, note, you still need to know which columns you are looking for - PredictedLabel
and/or Score
. But the advantage to the IDataView
approach is that you can read its Schema
and see all the columns available.
IDataView
is the core Data
type in ML.NET. You can read about it here: https://github.com/dotnet/machinelearning/blob/master/docs/code/MlNetHighLevelConcepts.md#list-of-high-level-concepts
@charlesroddie - yes! That is completely possible. It is a little more work, as the reflection approach is the "convenience", but you can easily step one level down and use the
IDataView
interface to interact with your data - both data going in and coming out of a pipeline.
This may be something for @sfilipi, @rogancarr, and @shmoradims to look for as as they do samples and documentation, is make sure there's sufficient mention of IDataView
.
The question by @charlesroddie though seems to be different though. He's (correct me if I'm wrong @charlesroddie) asking for a non-reflection driven approach that offers more type safety and checks. What makes me think this is the words, "without the annotation/reflection approach? This would give more type safety and compiler helpfulness." IDataView
provides less of that, not more -- it is not reflection driven per se, but it is no more type safe than Dictionary<string, object>
.
Regarding more type safety, this was the intent behind the so-called "static API" affectionately named "PiGSTy", which we do have some samples for, but it is strongly experimental work that some people dislike strongly due to it using delegates and whatnot to capture operations on data. So for v1 that work will still be very experimental, until we decide how we can improve it, or else people change their minds and suppose chained delegates to represent a data flow isn't actually such a bad idea after all. 😄
System information
Issue
In the GitHubLabeler sample application the input data is defined like this:
And the class to get the prediction is defined like this:
I think it makes sense that you need to define the prediction class yourself since I guess it could contain any kinds of properties and types.
If you have the model loaded and a model prediction function created it’s easy to get a prediction:
So far everything makes sense. But if you want to get scores of the prediction you need to change the prediction class to:
Personally, I find this to be very unintuitive. Can’t
PredictionFunction.Predict
instead return an object like this instead?Where
TDst
would beGitHubIssuePrediction
in this specific case. Or maybe something more fancy like this:As mentioned in #1881, it would be very useful to map a score to a specific category too.