dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.03k stars 1.88k forks source link

MulticlassClassification.CrossValidate Arithmetic operation resulted in an overflow #5211

Closed DFMERA closed 4 years ago

DFMERA commented 4 years ago

System information

Issue

Source code / logs

CODE

var tmpPath = GetAbsolutePath(TRAIN_DATA_FILEPATH); IDataView trainingDataView = mlContext.Data.LoadFromTextFile( path: tmpPath, hasHeader: true, separatorChar: '\t', allowQuoting: true, allowSparse: false);

        IDataView testDataView = mlContext.Data.BootstrapSample(trainingDataView);

// STEP 2: Run AutoML experiment Console.WriteLine($"Running AutoML Multiclass classification experiment for {ExperimentTime} seconds..."); ExperimentResult experimentResult = mlContext.Auto() .CreateMulticlassClassificationExperiment(ExperimentTime) .Execute(trainingDataView, labelColumnName: "reservation_status");

        // STEP 3: Print metric from the best model
        RunDetail<MulticlassClassificationMetrics> bestRun = experimentResult.BestRun;
        Console.WriteLine($"Total models produced: {experimentResult.RunDetails.Count()}");
        Console.WriteLine($"Best model's trainer: {bestRun.TrainerName}");
        Console.WriteLine($"Metrics of best model from validation data --");
        PrintMulticlassClassificationMetrics(bestRun.ValidationMetrics);

        // STEP 4: Evaluate test data
        IDataView testDataViewWithBestScore = bestRun.Model.Transform(testDataView);
        var testMetrics = mlContext.MulticlassClassification.CrossValidate(testDataViewWithBestScore, bestRun.Estimator, numberOfFolds: 5, labelColumnName: "reservation_status");

EXCEPTION

Unhandled Exception: System.OverflowException: Arithmetic operation resulted in an overflow. at Microsoft.ML.Data.VectorDataViewType.ComputeSize(ImmutableArray1 dims) at Microsoft.ML.Data.VectorDataViewType..ctor(PrimitiveDataViewType itemType, Int32[] dimensions) at Microsoft.ML.Transforms.KeyToVectorMappingTransformer.Mapper..ctor(KeyToVectorMappingTransformer parent, DataViewSchema inputSchema) at Microsoft.ML.Transforms.KeyToVectorMappingTransformer.MakeRowMapper(DataViewSchema schema) at Microsoft.ML.Data.RowToRowTransformerBase.GetOutputSchema(DataViewSchema inputSchema) at Microsoft.ML.Data.TrivialEstimator1.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input) at Microsoft.ML.Transforms.OneHotHashEncodingTransformer..ctor(HashingEstimator hash, IEstimator1 keyToVector, IDataView input) at Microsoft.ML.Transforms.OneHotHashEncodingEstimator.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input) at Microsoft.ML.TrainCatalogBase.CrossValidateTrain(IDataView data, IEstimator1 estimator, Int32 numFolds, String samplingKeyColumn, Nullable1 seed) at Microsoft.ML.MulticlassClassificationCatalog.CrossValidate(IDataView data, IEstimator1 estimator, Int32 numberOfFolds, String labelColumnName, String samplingKeyColumnName, Nullable1 seed) at ConsoleAppML2ML.ConsoleApp.ModelBuilder.CreateExperiment() in C:\repos\Curso ML\Bootcamp-Handytec\Clasificacion_multiclase\ConsoleAppML2\ConsoleAppML2ML.ConsoleApp\ModelBuilder.cs:line 77 at ConsoleAppML2ML.ConsoleApp.Program.Main(String[] args) in C:\repos\Curso ML\Bootcamp-Handytec\Clasificacion_multiclase\ConsoleAppML2\ConsoleAppML2ML.ConsoleApp\Program.cs:line 20 HotelBookings.tsv.zip

wangyems commented 4 years ago

Thanks for providing the information. I have run part of your code(the only difference is that I use the multi-classification related dataset in ML.NET samples) but not able to reproduce the error. It would be better if you can provide how you extract the data from file if you feel comfortable to do so. Either way, I'll take time on reproducing the error using the dataset you uploaded.

Since the error has something to do with HashEstimator which had core functionality change in release 1.5. As a possible work around, try downgrading to release 1.5 preview 2 and see if that solves the problem.

DFMERA commented 4 years ago

Thank you for the reply. I'm using this code to extract the data from file. var tmpPath = GetAbsolutePath(TRAIN_DATA_FILEPATH); IDataView trainingDataView = mlContext.Data.LoadFromTextFile( path: tmpPath, hasHeader: true, separatorChar: '\t', allowQuoting: true, allowSparse: false);

I think the problem is with big datasets like the one I attached. thanks.

wangyems commented 4 years ago

Which ML.NET version are you using? It seems the above code you provided is not quite fit in our current release.

DFMERA commented 4 years ago

I just updated to ML.NET 16.0 y .NetCore 3.1. I'm gonna create a small simple project with just the code you need to reproduce the error. I've tried different datasets and different experiment options and the error is the same in the multiclass experiment.

wangyems commented 4 years ago

Update: I truncated your dataset to hundreds of rows and your code works fine. So you may right the problem may be due to the large dataset. I am right now running using the original dataset and waiting for the results.

DFMERA commented 4 years ago

CORRECTION. I'm using this version of ML.NET "Microsoft.ML" Version="1.5.0" "Microsoft.ML.AutoML" Version="0.17.0" And .NETCore 3.1

DFMERA commented 4 years ago

Update: Here it is a small project with just the code you need to reproduce the error. I tried with a small dataset and the error is the same. MLMulticlassExperiment.zip

wangyems commented 4 years ago

Thanks for providing the example and version information:) So right now I can reproduce the error. I'll take a look into this.

wangyems commented 4 years ago

The reason for throwing overflow is as follows:

The _reservation_statusdate column has a little bit more unique values which triggers AutoML to map each of those unique value to a vector of size 65536. The AutoML also believes that there could be up to 65536 unique values so an overall size of 65536*65536 vector will be generated to store those mapped value. That size is bigger than some threshold. That's why you see an overflow. I think this is not a bug, but something can be improved in the future as I expect this Classification module in ML.NET can not deal with million-level categories classification.(I'll discuss this with my team)

As a work around(perhaps a better way), what you can do is pre-processing the _reservation_statusdate column before using AutoML(not auto enough in this case). Specifically, for example like 6/23/2015 12:00:00 AM. You can parse it to three float columns: 6(_reservation_statusmonth), 23(_reservation_statusdate), 2015(_reservation_statusyear). the 12:00:00 AM seems stay the same throughout the dataset, you can safely ignore it. This approach not only avoid overflow, but also provide a better(useful, with clear information) feature.

Feel free to reach out if you still have any questions.

DFMERA commented 4 years ago

Thank you. I did the change you recommended and it solved the problem. If I'm allow to make a suggestion maybe the name of the column with the value problem could appear in the inner exception so anyone can know what to correct in the data.

wangyems commented 4 years ago

Glad to know that the problem was solved. And yes I think it's better to improve the exception message by adding the column name. I'll have a PR out soon for this.

DFMERA commented 4 years ago

Do I have to close the issue ? or do you close it when the PR is done?

mstfbl commented 4 years ago

Hi @DFMERA , the issue will close when PR #5232 is merged. You don't need to do anything else!

wangyems commented 4 years ago

Hi @DFMERA , I took a look of the AutoML samples in our codebase() and find they use Evaluate() instead of Crossvalidate(https://github.com/dotnet/machinelearning/blob/master/docs/samples/Microsoft.ML.AutoML.Samples/MulticlassClassificationExperiment.cs#L40) // STEP 4: Evaluate test data IDataView testDataViewWithBestScore = bestRun.Model.Transform(testDataView); MulticlassClassificationMetrics testMetrics = mlContext.MulticlassClassification.Evaluate(testDataViewWithBestScore, labelColumnName: LabelColumnName); CrossValidate() and CreateMulticlassClassificationExperiment(...).Execute(...); are both training methods while Evaluate() is for inference(evaluation). Is it the Evaluate() you initially want to use?

DFMERA commented 4 years ago

Hi. I think that part of the code is fine because is for evaluating the model with the test data that was not included in the training process, and that code is base on the ml.net documentation in this page https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.automl.multiclassclassificationexperiment?view=ml-dotnet-preview

Although, I did see an error in the test IDataView (testDataView) since it contained the entire dataset including the data that was used in the training process. But I fixed that after excluding the column with the problem that you told me, so I don't know if that could also have solved the problem.

wangyems commented 4 years ago

Hi @DFMERA , The code in the link in your reply uses Evaluate() in step 4, but in your source code CrossValidate() is used. And please note that CrossValidate() is not for evaluating the data but training the data. To solve the problem in summary: 1, you may want to change this line from var testMetrics = mlContext.MulticlassClassification.CrossValidate(testDataViewWithBestScore, bestRun.Estimator, numberOfFolds: 5, labelColumnName: "reservation_status"); to MulticlassClassificationMetrics testMetrics = mlContext.MulticlassClassification.Evaluate(testDataViewWithBestScore, labelColumnName: LabelColumnName); by following the https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.automl.multiclassclassificationexperiment?view=ml-dotnet-preview 2, do something with the column reservation_status: either excluding or splitting it. This is optional because if you done 1, there should not be any exceptions.