Closed zeraphil closed 4 years ago
I understand your problem, can you share some example code and sample data so I can attempt to reproduce this issue?
Thank you. I've reworked the issue using the Github multiclass classification problem from the tutorial. I can supply the code so that you can drop straight into that solution and see if there's anything glaring I'm missing.
Supplying said project here. (one moment while I post it)
Hey @Zeraphil , please provide sample code from your project, so that we can attempt to replicate the issue. Thanks!
Hi, I've supplied the project with the code, but on closer inspection this performs fine with the test dataset. I can't share my source dataset yet I think, so the dataset is probably the reason, but it seems very bizarre that that would be the reason. I'm sharing the code anyways to see if there's anything incorrect or that I'm missing about setting up the pipelines as I've described.
@Zeraphil It appears that are saying that it works okay so far and you don't have a repro. I am closing the issue. Please share the dataset and reopen the issue if you happen to reproduce the issue again.
System information
OS version/distro:
.NET Version (eg., dotnet --info):
Issue
-What did I do?
Source code / logs
This is what I mean by splitting the pipeline.
var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText("WordFeatures", "Transcript") .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"));
IDataView transformedTrainingData = dataProcessPipeline.Fit(trainingDataView).Transform(trainingDataView);
var trainer = mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(labelColumnName:"Label", featureColumnName: "Features") .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));
ITransformer model = trainer.Fit(transformedTrainingData );
I understand this to be conceptually similar to
var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText("WordFeatures", "Transcript") .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"));
ITransformer model =trainingPipeline .Fit(trainingDataView);
The first gives me ~50% accuracy on the test set, the second ~98.9%. Is my understanding incorrect? Is there a missing step on doing things this way? My goal is to create a single Transformer for featurizing data that can then be used in more than one model, without having to build multiple models that featurize the data in the exact same way.
Hopefully the example makes sense. I can clarify if not.