dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Predict expects the Label as input #3063

Open singlis opened 5 years ago

singlis commented 5 years ago

Issue

When calling Predict, our Predict method will take in the same input as what is used for the training pipeline. This is a bit "odd" as we force the user to define a "Label" variable that does nothing nor is it needed for the output.

Using the example from #3037, we have something like this:

    let predictor = mlContext.Model.CreatePredictionEngine(transformer)
    let prediction:Prediction = predictor.Predict({Area=0; Price = 209000})

Where Area is our "Label", because this is required by the pipeline, we have to add this in as part of the input.

Could our pipeline change to only consume the data that is needed to do the prediction? And ideally have something like this:

let prediction:Prediction = predictor.Predict(209000)

cc @glebuk for any additional comments.

TomFinley commented 5 years ago

When calling Predict, our Predict method will take in the same input as what is used for the training pipeline. This is a bit "odd" as we force the user to define a "Label" variable that does nothing nor is it needed for the output.

Presumably only if you have transformers that are consuming Label. The reason why we have a composable, separable, data pipeline is that you're free to drop whatever transformations you want from it at any time. (Of course, it is your responsibility to make sure the result works well together.)

In fact @Zruty0 even devised a convenience for this case. If you have a transformer, you would when building the pipeline the .Append method, change the scope argument from its default of Everything to TrainTest. Then when you fit and transform, if you wanted the subset of things excepting whatever you labeled as only being relevant to training and testing, you'd call on the resulting TransformerChain the method GetModelFor(TransformerScope.Scoring).

This however is not documented. The two key methods here, have absolutely no XML documentation. Our samples have usage of TransformerSCope.TrainTest, but no usage of the model filtering, and the cookbook makes no mention of it.

https://github.com/dotnet/machinelearning/blob/db4ecc0135baffc8930201a85fb3e9a101cdfa46/src/Microsoft.ML.Data/DataLoadSave/EstimatorChain.cs#L87

https://github.com/dotnet/machinelearning/blob/db4ecc0135baffc8930201a85fb3e9a101cdfa46/src/Microsoft.ML.Data/DataLoadSave/TransformerChain.cs#L145

Of course, now that we're approaching stabilization of the API, we will I suppose work on improving the documentation. This is obviously something that must be done in general, but this is an important enough situation that it definitely warrants special consideration, its own section in documentation no doubt.

Assuming I understand the issue correctly, maybe we can change this from being enhancement to documentation. But I'll leave that to you @singlis; possibly I've misunderstood the point.

singlis commented 5 years ago

Thanks @TomFinley - I agree with you that documentation would help in addition to having its own section. I created this issue #3098 to help track documentation efforts, but left this issue to discuss further.

While that solution works, the user would need to have additional knowledge of our API to know what they are doing -- it is something that would not come naturally. We are setting a mask (scope filter) on each transformer that is appended in the chain. This mask defaults to Everything, unless its overridden in the Append call (which you show in the signature above). The user would have to be aware of the scope filter flag, knowledge of what transformer to set the filter on and know what value to set the filter.

Documentation would definitely help - but I do wonder how often this feature would be used rather than using the same class with a default value for Label.

Could we determine what are the transforms that generate the Label column and filter them out for the user? Or are there other scenarios that would be broken from a feature like this?

TomFinley commented 5 years ago

I look at it very differently. We have a very clear, explicit chain of being told what to do: we have deliberately added this thing to a chain when we didn't have to, then we've stated that we wanted a prediction engine over this chain. I have provided what I think are two very reasonable ways of resolving this "problem."

  1. Don't insert it into the pipeline in the first place, which is probably the simplest option, or

  2. If we insist on inserting it into the pipeline, utilize the convenience to filter it out later when appropriate.

You're proposing a third option, but I'm less excited about that one.

  1. Despite the explicit usage of our API to perform this step, rather than fail when we detect missing columns we insert some logic that detects (through some undefined mechanism) whether the user was "kidding" when they told us to perform these steps, and effectively rewrites their code so that it does something different from what they asked us to do. (Of course, we'd want to still fail if we determine that the user was not "kidding.")

I've never actually seen an attempt in programming languages or APIs to "do what I mean, not what I said" work -- you can kind of get away with it in tools where the user can provide immediate feedback, but, when doing actual software development, I haven't seen that work. Well intentioned attempts to do so have consistently failed in ML.NET to the point where we eventually just threw up our hands and said, never again. Making the code comprehensible and readable becomes really, really hard when there's a lot of "magic" happening to user's data they can't see. Code that by first basic principles shouldn't work suddenly starting to work doesn't mean you've made things easier for a user; far from it.

I could potentially see a suggestion from the Roslyn code analyzer under appropriate circumstances, but that's about the extent of it. The message this hypothetical Roslyn analyzer might make is: "It looks like you're transforming a label column. Would you like help?" (Yes, I stole your joke Scott. :) ) But I can see many ways in which such a system would be imperfect at best.

nicolehaugen commented 5 years ago

@colbylwilliams and myself also ran into this issue since it is confusing that a prediction would require a label. For example, I am using the below code to do a batch prediction - if the SearchResultData does not contain a label, you get an exception complaining that the Label is missing :

System.ArgumentOutOfRangeException: 'Could not find input column 'Label' Parameter name: inputSchema'

I wouldn't expect the Label column to be required since this should only be needed for training\evaluating the data in supervised learning scenarios. Requiring it is confusing because the Label value is what is being predicted.


Example code:

// Load test data and use the model to perform predictions on it. IDataView testData = mlContext.Data.LoadFromTextFile(testDatasetPath, separatorChar: '\t', hasHeader: false);

        // Load the model.
        DataViewSchema predictionPipelineSchema;
        ITransformer predictionPipeline = mlContext.Model.Load(modelPath, out predictionPipelineSchema);

        // Predict rankings.
        IDataView predictions = predictionPipeline.Transform(testData);

@CESARDELATORRE - interested to know your thoughts on this.