Unexpected behavior using multiple pipelines for training vs one

zeraphil commented 4 years ago

System information

OS version/distro:
.NET Version (eg., dotnet --info):

Issue

-What did I do?

Instead of one training pipeline, I split the pipeline to fit and transform the data into a featurized IDataView, and used that to train a classifier -What happened? Classifier accuracy was 50% of accuracy when the training pipeline is a complete chain. -What did I expect I expect that the accuracy of the model trained on either with the TransformerChain pipeline or trainingData transformed by the same TransformerChain be similar in accuracy

Source code / logs

This is what I mean by splitting the pipeline.

var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText("WordFeatures", "Transcript") .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"));

IDataView transformedTrainingData = dataProcessPipeline.Fit(trainingDataView).Transform(trainingDataView);

var trainer = mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(labelColumnName:"Label", featureColumnName: "Features") .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));

ITransformer model = trainer.Fit(transformedTrainingData );

I understand this to be conceptually similar to

var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText("WordFeatures", "Transcript") .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"));

       var trainer = mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(labelColumnName: "Label", featureColumnName: "Features")
                                  .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));
        var trainingPipeline = dataProcessPipeline.Append(trainer);`

ITransformer model =trainingPipeline .Fit(trainingDataView);

The first gives me ~50% accuracy on the test set, the second ~98.9%. Is my understanding incorrect? Is there a missing step on doing things this way? My goal is to create a single Transformer for featurizing data that can then be used in more than one model, without having to build multiple models that featurize the data in the exact same way.

Hopefully the example makes sense. I can clarify if not.

gvashishtha commented 4 years ago

I understand your problem, can you share some example code and sample data so I can attempt to reproduce this issue?

zeraphil commented 4 years ago

Thank you. I've reworked the issue using the Github multiclass classification problem from the tutorial. I can supply the code so that you can drop straight into that solution and see if there's anything glaring I'm missing.

Supplying said project here. (one moment while I post it)

mstfbl commented 4 years ago

Hey @Zeraphil , please provide sample code from your project, so that we can attempt to replicate the issue. Thanks!

zeraphil commented 4 years ago

TextClassification.zip

Hi, I've supplied the project with the code, but on closer inspection this performs fine with the test dataset. I can't share my source dataset yet I think, so the dataset is probably the reason, but it seems very bizarre that that would be the reason. I'm sharing the code anyways to see if there's anything incorrect or that I'm missing about setting up the pipelines as I've described.

harishsk commented 4 years ago

@Zeraphil It appears that are saying that it works okay so far and you don't have a repro. I am closing the issue. Please share the dataset and reopen the issue if you happen to reproduce the issue again.

dotnet / machinelearning