dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.94k stars 1.86k forks source link

LightGMB bad allocation crash #6817

Open Ceeeeed opened 10 months ago

Ceeeeed commented 10 months ago

Hello, during an AutoML regression training session, after more than three hours of training and successful training of 5 models [LightGBM] [Warning] bad allocation warnings shows in the console, and after a while the program crashes.

About my dataset:

111 275 columns and 872 rows (including label column). Contains only floats from -1 to 1.

Code responsible for training the model:

public async Task<IEnumerable<TrialResult>> TrainModel(DataOperationsCatalog.TrainTestData trainValidationData)
{
    Logger.Log($"Running the experiment...");

    AutoMLExperiment experiment = MLContext.Auto().CreateExperiment();

    experiment
        .SetPipeline(pipeline)
        .SetRegressionMetric(RegressionMetric.MeanAbsoluteError)
        .SetTrainingTimeInSeconds(maxTrainingTime)
        .SetDataset(trainValidationData);

    CancellationTokenSource cts = new();

    AutoMLMonitor monitor = new(pipeline, maxTrainingTime, maxTrainingIterations, cts);
    experiment.SetMonitor(monitor);

    await experiment.RunAsync(cts.Token);
    return monitor.GetCompletedTrials();
}

My pipeline is simple:

pipeline = MLContext.Auto().Regression(useFastForest: false, useFastTree: false, useLbfgs: false, useLgbm: true, useSdca: false);

Logs:

(In the timestamp, the first number is the application running time (hh:mm) and the second number is the local time (hh:mm:ss))

[0:00 - 06:59:59] Loading data set...
[0:01 - 07:01:09] Creating data view...
[0:01 - 07:01:11] Running the experiment...
[0:01 - 07:01:11] Model 1 started training using LightGbmRegression algorithm
[0:01 - 07:01:20] 10s/80000s (0.01 %) - Model finished training in 9.44s using LightGbmRegression algorithm (CPU: 95.51 %, RAM: 1828.41) - result: 15.08 %
[0:01 - 07:01:20] ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲ New best result! (15.08 %) ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲
[0:01 - 07:01:20] Model 2 started training using LightGbmRegression algorithm
[0:03 - 07:02:42] 91s/80000s (0.11 %) - Model finished training in 81.75s using LightGbmRegression algorithm (CPU: 102.15 %, RAM: 1971.35) - result: 8.46 %
[0:03 - 07:02:42] ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲ New best result! (8.46 %) ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲
[0:03 - 07:02:42] Model 3 started training using LightGbmRegression algorithm
[0:10 - 07:09:56] 526s/80000s (0.66 %) - Model finished training in 434.11s using LightGbmRegression algorithm (CPU: 101.56 %, RAM: 2008.12) - result: 6.49 %
[0:10 - 07:09:56] ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲ New best result! (6.49 %) ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲
[0:10 - 07:09:56] Model 4 started training using LightGbmRegression algorithm
[3:09 - 10:08:58] 11267s/80000s (14.08 %) - Model finished training in 10741.15s using LightGbmRegression algorithm (CPU: 104.30 %, RAM: 5345.96) - result: 5.58 %
[3:09 - 10:08:58] ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲ New best result! (5.58 %) ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲
[3:09 - 10:08:58] Model 5 started training using LightGbmRegression algorithm
[3:21 - 10:20:58] 11987s/80000s (14.98 %) - Model finished training in 720.38s using LightGbmRegression algorithm (CPU: 102.34 %, RAM: 2526.46) - result: 8.69 %
[3:21 - 10:20:58] Model 6 started training using LightGbmRegression algorithm
[LightGBM] [Warning] bad allocation
[LightGBM] [Warning] bad allocation
[LightGBM] [Warning] [LightGBM] [Warning] bad allocation
bad allocation
[LightGBM] [Warning] bad allocation
[LightGBM] [Warning] bad allocation
[LightGBM] [Warning] bad allocation
(...) more of those

full log.txt