Closed DFMERA closed 4 years ago
Thanks for providing the information. I have run part of your code(the only difference is that I use the multi-classification related dataset in ML.NET samples) but not able to reproduce the error. It would be better if you can provide how you extract the data from file if you feel comfortable to do so. Either way, I'll take time on reproducing the error using the dataset you uploaded.
Since the error has something to do with HashEstimator which had core functionality change in release 1.5. As a possible work around, try downgrading to release 1.5 preview 2 and see if that solves the problem.
Thank you for the reply. I'm using this code to extract the data from file. var tmpPath = GetAbsolutePath(TRAIN_DATA_FILEPATH); IDataView trainingDataView = mlContext.Data.LoadFromTextFile( path: tmpPath, hasHeader: true, separatorChar: '\t', allowQuoting: true, allowSparse: false);
I think the problem is with big datasets like the one I attached. thanks.
Which ML.NET version are you using? It seems the above code you provided is not quite fit in our current release.
I just updated to ML.NET 16.0 y .NetCore 3.1. I'm gonna create a small simple project with just the code you need to reproduce the error. I've tried different datasets and different experiment options and the error is the same in the multiclass experiment.
Update: I truncated your dataset to hundreds of rows and your code works fine. So you may right the problem may be due to the large dataset. I am right now running using the original dataset and waiting for the results.
CORRECTION. I'm using this version of ML.NET "Microsoft.ML" Version="1.5.0" "Microsoft.ML.AutoML" Version="0.17.0" And .NETCore 3.1
Update: Here it is a small project with just the code you need to reproduce the error. I tried with a small dataset and the error is the same. MLMulticlassExperiment.zip
Thanks for providing the example and version information:) So right now I can reproduce the error. I'll take a look into this.
The reason for throwing overflow is as follows:
The _reservation_statusdate column has a little bit more unique values which triggers AutoML to map each of those unique value to a vector of size 65536. The AutoML also believes that there could be up to 65536 unique values so an overall size of 65536*65536 vector will be generated to store those mapped value. That size is bigger than some threshold. That's why you see an overflow. I think this is not a bug, but something can be improved in the future as I expect this Classification module in ML.NET can not deal with million-level categories classification.(I'll discuss this with my team)
As a work around(perhaps a better way), what you can do is pre-processing the _reservation_statusdate column before using AutoML(not auto enough in this case). Specifically, for example like 6/23/2015 12:00:00 AM. You can parse it to three float columns: 6(_reservation_statusmonth), 23(_reservation_statusdate), 2015(_reservation_statusyear). the 12:00:00 AM seems stay the same throughout the dataset, you can safely ignore it. This approach not only avoid overflow, but also provide a better(useful, with clear information) feature.
Feel free to reach out if you still have any questions.
Thank you. I did the change you recommended and it solved the problem. If I'm allow to make a suggestion maybe the name of the column with the value problem could appear in the inner exception so anyone can know what to correct in the data.
Glad to know that the problem was solved. And yes I think it's better to improve the exception message by adding the column name. I'll have a PR out soon for this.
Do I have to close the issue ? or do you close it when the PR is done?
Hi @DFMERA , the issue will close when PR #5232 is merged. You don't need to do anything else!
Hi @DFMERA ,
I took a look of the AutoML samples in our codebase() and find they use Evaluate() instead of Crossvalidate(https://github.com/dotnet/machinelearning/blob/master/docs/samples/Microsoft.ML.AutoML.Samples/MulticlassClassificationExperiment.cs#L40)
// STEP 4: Evaluate test data IDataView testDataViewWithBestScore = bestRun.Model.Transform(testDataView); MulticlassClassificationMetrics testMetrics = mlContext.MulticlassClassification.Evaluate(testDataViewWithBestScore, labelColumnName: LabelColumnName);
CrossValidate() and CreateMulticlassClassificationExperiment(...).Execute(...); are both training methods while Evaluate() is for inference(evaluation). Is it the Evaluate() you initially want to use?
Hi. I think that part of the code is fine because is for evaluating the model with the test data that was not included in the training process, and that code is base on the ml.net documentation in this page https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.automl.multiclassclassificationexperiment?view=ml-dotnet-preview
Although, I did see an error in the test IDataView (testDataView) since it contained the entire dataset including the data that was used in the training process. But I fixed that after excluding the column with the problem that you told me, so I don't know if that could also have solved the problem.
Hi @DFMERA ,
The code in the link in your reply uses Evaluate() in step 4, but in your source code CrossValidate() is used. And please note that CrossValidate() is not for evaluating the data but training the data.
To solve the problem in summary:
1, you may want to change this line from
var testMetrics = mlContext.MulticlassClassification.CrossValidate(testDataViewWithBestScore, bestRun.Estimator, numberOfFolds: 5, labelColumnName: "reservation_status");
to
MulticlassClassificationMetrics testMetrics = mlContext.MulticlassClassification.Evaluate(testDataViewWithBestScore, labelColumnName: LabelColumnName);
by following the https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.automl.multiclassclassificationexperiment?view=ml-dotnet-preview
2, do something with the column reservation_status: either excluding or splitting it. This is optional because if you done 1, there should not be any exceptions.
System information
Issue
What did you do? I am creating a multiclass classification experiment and after de best model is selected and I try to evaluate de model but this method throws an exception var testMetrics = mlContext.MulticlassClassification.CrossValidate(testDataViewWithBestScore, bestRun.Estimator, numberOfFolds: 5, labelColumnName: "reservation_status");
What happened? The mlContext.MulticlassClassification.CrossValidate throws an exception
What did you expect? To recover the metrics of the model on test data
Source code / logs
CODE
var tmpPath = GetAbsolutePath(TRAIN_DATA_FILEPATH); IDataView trainingDataView = mlContext.Data.LoadFromTextFile(
path: tmpPath,
hasHeader: true,
separatorChar: '\t',
allowQuoting: true,
allowSparse: false);
// STEP 2: Run AutoML experiment Console.WriteLine($"Running AutoML Multiclass classification experiment for {ExperimentTime} seconds..."); ExperimentResult experimentResult = mlContext.Auto()
.CreateMulticlassClassificationExperiment(ExperimentTime)
.Execute(trainingDataView, labelColumnName: "reservation_status");
EXCEPTION
Unhandled Exception: System.OverflowException: Arithmetic operation resulted in an overflow. at Microsoft.ML.Data.VectorDataViewType.ComputeSize(ImmutableArray
1 dims) at Microsoft.ML.Data.VectorDataViewType..ctor(PrimitiveDataViewType itemType, Int32[] dimensions) at Microsoft.ML.Transforms.KeyToVectorMappingTransformer.Mapper..ctor(KeyToVectorMappingTransformer parent, DataViewSchema inputSchema) at Microsoft.ML.Transforms.KeyToVectorMappingTransformer.MakeRowMapper(DataViewSchema schema) at Microsoft.ML.Data.RowToRowTransformerBase.GetOutputSchema(DataViewSchema inputSchema) at Microsoft.ML.Data.TrivialEstimator
1.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input) at Microsoft.ML.Transforms.OneHotHashEncodingTransformer..ctor(HashingEstimator hash, IEstimator
1 keyToVector, IDataView input) at Microsoft.ML.Transforms.OneHotHashEncodingEstimator.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain
1.Fit(IDataView input) at Microsoft.ML.TrainCatalogBase.CrossValidateTrain(IDataView data, IEstimator1 estimator, Int32 numFolds, String samplingKeyColumn, Nullable
1 seed) at Microsoft.ML.MulticlassClassificationCatalog.CrossValidate(IDataView data, IEstimator1 estimator, Int32 numberOfFolds, String labelColumnName, String samplingKeyColumnName, Nullable
1 seed) at ConsoleAppML2ML.ConsoleApp.ModelBuilder.CreateExperiment() in C:\repos\Curso ML\Bootcamp-Handytec\Clasificacion_multiclase\ConsoleAppML2\ConsoleAppML2ML.ConsoleApp\ModelBuilder.cs:line 77 at ConsoleAppML2ML.ConsoleApp.Program.Main(String[] args) in C:\repos\Curso ML\Bootcamp-Handytec\Clasificacion_multiclase\ConsoleAppML2\ConsoleAppML2ML.ConsoleApp\Program.cs:line 20 HotelBookings.tsv.zip