dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.01k stars 1.88k forks source link

ML.Net: System.OutOfMemoryException: 'Exception of type 'System.OutOfMemoryException' was thrown.' on small dataset #7040

Open harry-hathorn opened 7 months ago

harry-hathorn commented 7 months ago

System Information (please complete the following information):

Describe the bug Attempt to train model and run into out-of-memory exception, PC doesn't even use 20% of memory. Build for any CPU.

To Reproduce Steps to reproduce the behavior:

  1. Load data text-based set with 700 000 rows (60mb) and 2 columns (feature and label)
  2. Run Transforms.Conversion.MapValueToKey for the Label
  3. Run Transforms.Text.FeaturizeText on the Features
  4. Append a mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features") prediction
  5. Attempt to Fit the model
  6. Receive an out of memory exception on trainingPipeline.Fit(trainData)
    System.OutOfMemoryException
    HResult=0x8007000E
    Message=Exception of type 'System.OutOfMemoryException' was thrown.
    Source=Microsoft.ML.Core
    StackTrace:
    at Microsoft.ML.Internal.Utilities.VBufferUtils.CreateDense[T](Int32 length)
    at Microsoft.ML.Trainers.SdcaTrainerBase`3.TrainCore(IChannel ch, RoleMappedData data, LinearModelParameters predictor, Int32 weightSetCount)
    at Microsoft.ML.Trainers.StochasticTrainerBase`2.TrainModelCore(TrainContext context)
    at Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
    at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
    at Program.<Main>$(String[] args) in C:\Ml.Product.2\Ml.Product.2\Program.cs:line 29

Expected behavior I have a 60mb CSV with 700000 rows, IMO this is not a huge amount. My machine has 32 GB of memory and doesn't even use 20% of my memory when I watch performance. I tried to build a release build on 64bit and still ran into the out-of-memory exception. Please could someone advise me on what I am doing wrong, this seems like a bug? Eventually, I want to train much larger data sets, surely ML.Net should be able to do that?

Screenshots, Code, Sample Projects

MLContext _mlContext;
PredictionEngine<MlProduct, MlProductPrediction> _predictionEngine;
ITransformer _trainedModel;
IDataView _trainingDataView;

_mlContext = new MLContext();

_trainingDataView = LoadDataFromCSV();

TrainTestData dataSplit = _mlContext.Data.TrainTestSplit(_trainingDataView, testFraction: 0.2);
IDataView trainData = dataSplit.TrainSet;
IDataView testData = dataSplit.TestSet;

var pipeline = _mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "CategoryName", outputColumnName: "Label")
           .Append(_mlContext.Transforms.Text.FeaturizeText("Features", "ProductName"));

var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features"))
       .Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));

_trainedModel = trainingPipeline.Fit(trainData);

IDataView transformTest = _trainedModel.Transform(testData);

public class MlProduct
{

    [LoadColumn(0)]
    [ColumnName("ProductName")]
    public string ProductName { get; set; }
    [LoadColumn(1)]
    [ColumnName("CategoryName")]
    public string CategoryName { get; set; }
}

public class MlProductPrediction
{
    [ColumnName("PredictedLabel")]
    public string CategoryName;

    [ColumnName("PredictionScore")]
    public float Score { get; set; }
}