dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Unexpected behavior using multiple pipelines for training vs one #5033

Closed zeraphil closed 4 years ago

zeraphil commented 4 years ago

System information

Issue

-What did I do?

Source code / logs

This is what I mean by splitting the pipeline.

var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText("WordFeatures", "Transcript") .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"));

IDataView transformedTrainingData = dataProcessPipeline.Fit(trainingDataView).Transform(trainingDataView);

var trainer = mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(labelColumnName:"Label", featureColumnName: "Features") .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));

ITransformer model = trainer.Fit(transformedTrainingData );

I understand this to be conceptually similar to

var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText("WordFeatures", "Transcript") .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"));

       var trainer = mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(labelColumnName: "Label", featureColumnName: "Features")
                                  .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));
        var trainingPipeline = dataProcessPipeline.Append(trainer);`

ITransformer model =trainingPipeline .Fit(trainingDataView);

The first gives me ~50% accuracy on the test set, the second ~98.9%. Is my understanding incorrect? Is there a missing step on doing things this way? My goal is to create a single Transformer for featurizing data that can then be used in more than one model, without having to build multiple models that featurize the data in the exact same way.

Hopefully the example makes sense. I can clarify if not.

gvashishtha commented 4 years ago

I understand your problem, can you share some example code and sample data so I can attempt to reproduce this issue?

zeraphil commented 4 years ago

Thank you. I've reworked the issue using the Github multiclass classification problem from the tutorial. I can supply the code so that you can drop straight into that solution and see if there's anything glaring I'm missing.

Supplying said project here. (one moment while I post it)

mstfbl commented 4 years ago

Hey @Zeraphil , please provide sample code from your project, so that we can attempt to replicate the issue. Thanks!

zeraphil commented 4 years ago

TextClassification.zip

Hi, I've supplied the project with the code, but on closer inspection this performs fine with the test dataset. I can't share my source dataset yet I think, so the dataset is probably the reason, but it seems very bizarre that that would be the reason. I'm sharing the code anyways to see if there's anything incorrect or that I'm missing about setting up the pipelines as I've described.

harishsk commented 4 years ago

@Zeraphil It appears that are saying that it works okay so far and you don't have a repro. I am closing the issue. Please share the dataset and reopen the issue if you happen to reproduce the issue again.