dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.93k stars 1.86k forks source link

AutoML Binary Classification Experiment run for few second and finished without models #6882

Open 80LevelElf opened 7 months ago

80LevelElf commented 7 months ago

System Information (please complete the following information):

Describe the bug At this moment we use ML.net 2, but because of the bug fix of https://github.com/dotnet/machinelearning/pull/6571 we have to switch to 3 version of ML.net to train our Binary Classification models (we need Positive Recall optimization metric).

But looks like Binary Classification Experiment is somehow broken in 3 version of ML.net:

        var settings = new BinaryExperimentSettings
        {
            MaxExperimentTimeInSeconds = 30 * 60,
            //MaxModels = 10,
            OptimizingMetric = BinaryClassificationMetric.PositiveRecall,
            MaximumMemoryUsageInMegaByte = 7500,
            UseAutoZeroTuner = false
        };

        ExperimentResult<BinaryClassificationMetrics> experimentResult = experiment
            .Execute(trainDataView, nameof(MlModelRow.Label), nameof(MlModelRow.LearningGroup));

We use only FastForest and LightGBM trainers. On my local PC (Windows 10) it's working great, but in the production docker image (Alpine Linux) the learning is finished after 10-30 seconds with:

Training time finished without completing a successful trial. Either no trial completed or the metric for all completed trials are NaN or Infinity

I have tried to:

  1. Use MaxModels = 10 with MaxExperimentTimeInSeconds
  2. Use MaxModels = 10 insted of MaxExperimentTimeInSeconds
  3. Turn UseAutoZeroTuner to true

But nothing is working for me. Important point - MLNET_BACKEND is not set so we are not using OneDAL on production or test environment.

80LevelElf commented 7 months ago

I have just try to switch to OneDAL mode for production. It doesn't help (

80LevelElf commented 7 months ago

Maybe some temporary workarounds?

I think it's really a big problem regarding to ml.net 3 should be released this month.

80LevelElf commented 7 months ago

I have tried it for new ML.net 3

The same behavior

80LevelElf commented 7 months ago

@LittleLittleCloud @luisquintanilla

Hi friends! Maybe is there any workaround or any thinks we can check on our side?

80LevelElf commented 6 months ago

So I have found out the problem - it is because of MaximumMemoryUsageInMegaByte = 7500

Just after starting the used memory become more that 7500 Mb and learning become canceled.

At first point it's understandable behavior, but it looks like very unuseful. In fact Ml.net doesn't rule memory consumption in our case. We have to choose between:

But can't ml.net control count of models to train at one time by memory limit? Like limit it 7500 Mb and one model need 2500 Mb to train - so let's start 3 models.