dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.94k stars 1.86k forks source link

AutoML 2 is way worse than 1.7.1 (for me) #6552

Open TT-Dev1 opened 1 year ago

TT-Dev1 commented 1 year ago

Win10 / ML.NET 1.7.1 vs. 2.0.0 / .NET Framework 4.8

AutoML 2.0 is way worse for me than the previous 1.7.1 release. I tried using the Featurizer or even removing completely and doing it all by hand -- in 2 days of fiddling I can not create a model that is anywhere close to that created with the old CreateRegressionExperiment() version of the previous release. image

To Reproduce Steps to reproduce the behavior: For 2.0 (where the problem is) I used the same code as this sample (but with my objects): https://github.com/dotnet/machinelearning-samples/tree/main/samples/csharp/getting-started/MLNET2/AutoMLAdvanced

//Define pipeline
SweepablePipeline pipeline =
    ctx.Auto().Featurizer(data, columnInformation: columnInference.ColumnInformation)
        .Append(ctx.Auto().Regression(labelColumnName: columnInference.ColumnInformation.LabelColumnName, useLgbm: false));

// Create AutoML experiment
AutoMLExperiment experiment = ctx.Auto().CreateExperiment();

// Configure experiment
experiment
    .SetPipeline(pipeline)
    .SetRegressionMetric(RegressionMetric.RSquared, labelColumn: columnInference.ColumnInformation.LabelColumnName)
    .SetTrainingTimeInSeconds(60)
    .SetGridSearchTuner()
    .SetDataset(trainValidationData);

// Run experiment
var cts = new CancellationTokenSource();
TrialResult experimentResults = await experiment.RunAsync(cts.Token);

I also unwound the featurizer and did all the same steps by hand and they worked with 1.7.1.

Expected behavior To be able to train a model that works as well as the last version.

Additional context NOTE: I had all kinds of different versions on my machine and completely uninstalled Visual Studio, deleted the directory, etc.

Maybe relevant?

ANY IDEAS WHERE I CAN DEBUG MORE? OR TELL ME WHAT YOU WOULD LIKE TO HAVE ME CAN SHARE SO THAT I CAN BE MORE HELPFUL.

TT-Dev1 commented 1 year ago

More information...

I was able to get rid of the tails that I circled in red by adding a binary OneHotEncoding (Binary) to my pipeline

mlContext.Transforms.Categorical.OneHotEncoding(@"blah", @"blah", outputKind: OneHotEncodingEstimator.OutputKind.Binary) image

NOTE: the tail is still there with Indicator

NOTE2: the tail goes away when I drop the categorical field all together but I must also add .Append(mlContext.Transforms.NormalizeMinMax("Features", fixZero: false))

NOTE3: having Transforms.ReplaceMissingValues in my pipleline also causes the tail to appear

But even with this the R^2 and training results are still much worse than what I got with the 1.7.1 version.

IS THERE A WAY TO SEE WHAT PIPLEINE WAS CREATED IN 1.7.1 WITH mlContext.Auto().CreateRegressionExperiment()?

LittleLittleCloud commented 1 year ago

Looks like you're using GridSearch for HPO optimization and you disable LightGbm as well? Can you try using default tuner (by removing SetGridSearchTuner) instead?

In the meantime, you can still use AutoML v1.0 API in AutoML v2.0, which basically inherit the configuration of AutoML 1.7.1 in featurizer and trainers. Can you also give it a try and see if performance improves?

Now, after re-installing VS and adding ML.net, I no longer have the ability to edit notebooks (.ipynb). @JakeRadMSFT will know better.

TT-Dev1 commented 1 year ago

Hello Jake and thank you for your answer.

.SetGridSearchTuner()

GOOD EYE! Yes, that additional call does break everything and I was too quick in pasting the sample code.

I had already removed the call to SetGridSearchTuner() because with it then nothing works. So I'm still without an answer.

NOTE: leaving in lightgbm causes lots of errors in my log...

"failed with exception Unable to load DLL 'lib_lightgbm': The specified module could not be found. (Exception from HRESULT: 0x8007007E)"

I read this old post "Null reference exception when training #6470 " about the dll but was not able to resolve that. Maybe it's a sign that something isn't set up correctly?

However I still get the group not as tight and something that binds / limits the predicted range. image

TT-Dev1 commented 1 year ago

In the meantime, you can still use AutoML v1.0 API in AutoML v2.0, which basically inherit the configuration of AutoML 1.7.1 in featurizer and trainers. Can you also give it a try and see if performance improves?

When I just tried to use my 1.7.1 code with 2.0.1 I get an exception..

Exception thrown: 'System.AggregateException' in mscorlib.dll System.AggregateException: One or more errors occurred. ---> System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.ML.AutoML.AutoMLExperiment.<RunAsync>d__26.MoveNext() --- End of inner exception stack trace --- at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at Microsoft.ML.AutoML.AutoMLExperiment.Run() at Microsoft.ML.AutoML.RegressionExperiment.Execute(IDataView trainData, ColumnInformation columnInformation, IEstimator1 preFeaturizer, IProgress1 progressHandler) at TestML.BuildTrainEvaluateAndSaveModel(MLContext mlContext, String trainField, String dataInFname, String modelOutFname, String htmlChartFname, String logFname, UInt32 trainingSeconds, Boolean openResults, Boolean testPfi) in ..\TestML.cs:line 269 ---> (Inner Exception #0) System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.ML.AutoML.AutoMLExperiment.d__26.MoveNext()<---`

My code is pretty straightforward but I do create the column information structure by hand. And it works in 1.7.1.

But I see no difference in the structure that is created automatically from a csv. image

Here is the code / pseudo-code.

ColumnInformation columnInformation = new ColumnInformation();
columnInformation.TextColumnNames.Clear();
columnInformation.CategoricalColumnNames.Clear();

columnInformation.LabelColumnName = trainField;
columnInformation.ItemIdColumnName = "UID";

RemoveList(columnInformation, UNUSED);
// The function is essentially this...
//foreach (string s in removeList)                      
//  columnInformation.IgnoredColumnNames.Add(colName);
//  columnInformation.NumericColumnNames.Remove(colName);

AddNumericalList(columnInformation, USED);
// The function is essentially this...
//foreach (string s in addList)
//  columnInformation.NumericColumnNames.Add(s);

var experimentSettings = new RegressionExperimentSettings();
experimentSettings.MaxExperimentTimeInSeconds = trainingSeconds;
experimentSettings.CacheDirectoryName = null;   // keep models in memory
experimentSettings.OptimizingMetric = RegressionMetric.RSquared;

// Create an experiment
RegressionExperiment experiment = mlContext.Auto().CreateRegressionExperiment(experimentSettings);

// Run the experiment -- THIS IS WHERE IT FAILS
ExperimentResult<RegressionMetrics> experimentResult = experiment.Execute(trainingDataView, columnInformation);
LittleLittleCloud commented 1 year ago

@TT-Dev1 Thanks for the reply, and I definitly willing to help you figure out what's not working here. Especially on figuring out why it's not better than old AutoML

copy code from automl 1.7.1 not working

Looks like a dup of #https://github.com/dotnet/machinelearning/issues/6446. This issue has been fixed but haven't released to nuget yet. You can try nightly build though.

NOTE: leaving in lightgbm causes lots of errors in my log... "failed with exception Unable to load DLL 'lib_lightgbm': The specified module could not be found. (Exception from HRESULT: 0x8007007E)"

Are you running on a linux/osx arm64 device? If so LightGbm won't be available on those platforms.

A few more questions:

TT-Dev1 commented 1 year ago

Looks like a dup of ##6446. This issue has been fixed but haven't released to nuget yet. You can try nightly build though.

Thanks VERY MUCH!!!! I will try and report back.

I can verify (like you said) that this bug has not been fixed in the Dec. 22, 2022 release. image

Are you running on a linux/osx arm64 device? If so LightGbm won't be available on those platforms.

No, Win10, Intel x64.

EDIT: is there a way to force the install or is there a place that I can look to find the .dll?

Is your experiment running on AutoML 1.7.1 the same platform of experiment running AutoML 0.20.1

Yes, all on the same box.

TT-Dev1 commented 1 year ago

You can try nightly build though.

OK -- seems like I'm getting somewhere now. THANK YOU.

Trying AutoML v1.0 API in AutoML v2.0 causes a new (or more specific error) with the current (3.0.0-dev.23110.1 / 0.21.0-dev.23110.1) build.

// AutoMLExperiment.cs, line 246 is the source of the null reference exception -- "tuner can't be null"
public async Task<TrialResult> RunAsync(CancellationToken ct = default)
            var tuner = serviceProvider.GetService<ITuner>();
            Contracts.Assert(tuner != null, "tuner can't be null");

            var parameter = tuner.Propose(trialSettings);   // <<< line 246

Now that I have the libraries, I can be much more efficient at debugging this. I should have done that from the beginning. ;)

EDIT: I can also test the v2.0 methods to see if the results have improved.

EDIT2: The v2.0 api still fails / skips LightGbm...

Exception thrown: 'System.DllNotFoundException' in Microsoft.ML.LightGbm.dll
An exception of type 'System.DllNotFoundException' occurred in Microsoft.ML.LightGbm.dll but was not handled in user code
Unable to load DLL 'lib_lightgbm': The specified module could not be found. (Exception from HRESULT: 0x8007007E)

I haven't yet found where lib_lightgbm comes from -- I build the Microsoft.ML.LightGbm just fine as far as I can tell.

Can there be something strange in my environment causing BOTH of my issues?

LittleLittleCloud commented 1 year ago

Hmmm are you sure you are on the latest nightly build? The most recent version should be 3.0.0-preview.23109.1

from this feed https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet-libraries/NuGet/Microsoft.ML/overview/3.0.0-preview.23109.1

TT-Dev1 commented 1 year ago

Hmmm are you sure you are on the latest nightly build? The most recent version should be 3.0.0-preview.23109.1

Yes, I was one build ahead because I built the current source on that date -- but there were no code changes for a few days so we were on the same thing.

But I still have the problem.

So back to the project of determining....

ISSUE#1: Why can't I configure AutoML 2.0 to work as well as 1.7.1?

ISSUE#2: Why can't I run the 1.0 API with 2.0?

Some observations...

AutoML 1.7.1 -- 1 error in the log

|7 OnlineGradientDescentRegression -12.0358 19.26 1213.01 22.22 0.6 'IBVConnector.exe' (CLR v4.0.30319: IBVConnector.exe): Loaded 'C:\mltest\bin\Debug\Microsoft.ML.Mkl.Components.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled. Exception during AutoML iteration: System.ArgumentOutOfRangeException: Input matrix was not positive-definite. Try using a larger L2 regularization weight.

But it works and comes up with a tight model.

TT-Dev1 commented 1 year ago

AutoML 3.0.0-dev.23124.1

(current Git @ 2023-02-24 / 8am)

The GOOD NEWS is that my ML1 code is now running to completion but still with the worse results, fewer trainings and some exceptions logged.

|1 Unknown=>ReplaceMissingValues=>OneHotEncoding=>Concatenate=>FastForestRegression 0.6575 3.83 24.61 4.96 1.0

|2 Unknown=>ReplaceMissingValues=>OneHotEncoding=>Concatenate=>FastForestRegression 0.8989 2.95 15.12 3.89 0.9

|3 Unknown=>ReplaceMissingValues=>OneHotHashEncoding=>Concatenate=>FastTreeRegression -135.2384 117.29 13818.06 117.55 1.2


**NOTE: if I add OneHotEncoding to my preFeaturizer then it takes a very long time.**

//mlContext.Transforms.Categorical.OneHotEncoding(@"PrMorph", @"PrMorph", outputKind: OneHotEncodingEstimator.OutputKind.Binary) takes a very long time to complete!


I believe that other tests were running but they were cancelled because of time.

```Exception thrown: 'System.OperationCanceledException' in Microsoft.ML.Core.dll
An exception of type 'System.OperationCanceledException' occurred in Microsoft.ML.Core.dll but was not handled in user code
Operation was canceled.
System.ArgumentNullException: The model provided does not have a compatible predictor
Parameter name: lastTransformer
   at Microsoft.ML.Runtime.Contracts.CheckValue[T](IExceptionContext ctx, T val, String paramName, String msg)
   at Microsoft.ML.PermutationFeatureImportanceExtensions.PermutationFeatureImportance[TMetric,TResult](IHostEnvironment env, ITransformer model, IDataView data, Func`1 resultInitializer, Func`2 evaluationFunc, Func`3 deltaFunc, Int32 permutationCount, Boolean useFeatureWeightFilter, Nullable`1 numberOfExamplesToUse)
   at Microsoft.ML.PermutationFeatureImportanceExtensions.PermutationFeatureImportance(RegressionCatalog catalog, ITransformer model, IDataView data, String labelColumnName, Boolean useFeatureWeightFilter, Nullable`1 numberOfExamplesToUse, Int32 permutationCount)
ImmutableDictionary<string, RegressionMetricsStatistics> permutationFeatureImportance =
    mlContext.Regression
    .PermutationFeatureImportance(
                model,
                data,
                labelColumnName: trainField,
                useFeatureWeightFilter: false,
                numberOfExamplesToUse: null,
permutationCount: 1);
An exception of type 'System.ArgumentOutOfRangeException' occurred in Microsoft.ML.Core.dll but was not handled in user code
__Features__ column 'Feature' not found
The thread 0x7864 has exited with code 0 (0x0).
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.ArgumentOutOfRangeException: __Features__ column 'Feature' not found
Parameter name: schema
   at Microsoft.ML.Data.RoleMappedSchema.MapFromNames(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
System.InvalidOperationException: Can't bind the IDataView column 'PrMorph' of type 'Vector<Single, 4>' to field or property 'PrMorph' of type 'System.String'.
   at Microsoft.ML.Data.TypedCursorable`1..ctor(IHostEnvironment env, IDataView data, Boolean ignoreMissingColumns, InternalSchemaDefinition schemaDefn)

So, I removed this column from the training and removed it from my preFeaturizer.

REMOVED: .Append(mlContext.Transforms.Categorical.OneHotEncoding(@"PrMorphINT", @"PrMorph", outputKind: OneHotEncodingEstimator.OutputKind.Bag)); // .Bag = BEST; .Indicator = clipped range; .Binary = loose

Still had the exception when trying to run the

Parameter name: schema
   at Microsoft.ML.Data.RoleMappedSchema.MapFromNames(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
   at Microsoft.ML.Data.RoleMappedSchema..ctor(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
   at Microsoft.ML.Data.GenericScorer.Bindings.Create(IHostEnvironment env, ISchemaBindableMapper bindable, DataViewSchema input, IEnumerable`1 roles, String suffix, Boolean user)
   at Microsoft.ML.Data.GenericScorer.Bindings.ApplyToSchema(IHostEnvironment env, DataViewSchema input)
   at Microsoft.ML.Data.GenericScorer..ctor(IHostEnvironment env, GenericScorer transform, IDataView data)
   at Microsoft.ML.Data.GenericScorer.ApplyToDataCore(IHostEnvironment env, IDataView newSource)
   at Microsoft.ML.Data.RowToRowScorerBase.ApplyToData(IHostEnvironment env, IDataView newSource)
   at Microsoft.ML.Data.PredictionTransformerBase`1.Transform(IDataView input)
   at Microsoft.ML.Transforms.PermutationFeatureImportance`3.GetImportanceMetricsMatrix(IHostEnvironment env, IPredictionTransformer`1 model, IDataView data, Func`1 resultInitializer, Func`2 evaluationFunc, Func`3 deltaFunc, String features, Int32 permutationCount, Boolean useFeatureWeightFilter, Nullable`1 topExamples)
   --- End of inner exception stack trace ---
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
   at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at System.Reflection.MethodBase.Invoke(Object obj, Object[] parameters)
   at Microsoft.ML.PermutationFeatureImportanceExtensions.PermutationFeatureImportance[TMetric,TResult](IHostEnvironment env, ITransformer model, IDataView data, Func`1 resultInitializer, Func`2 evaluationFunc, Func`3 deltaFunc, Int32 permutationCount, Boolean useFeatureWeightFilter, Nullable`1 numberOfExamplesToUse)
   at Microsoft.ML.PermutationFeatureImportanceExtensions.PermutationFeatureImportance(RegressionCatalog catalog, ITransformer model, IDataView data, String labelColumnName, Boolean useFeatureWeightFilter, Nullable`1 numberOfExamplesToUse, Int32 permutationCount)
PredictionEngine<ModelInput, ModelOutput> pe = mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(trainedModel, trainingDataView.Schema);
ModelInput rec2Test = mlContext.Data.CreateEnumerable<ModelInput>(trainingDataView, reuseRowObject: false).First<ModelInput>();
ModelOutput mo = pe.Predict(rec2Test);
Debug.WriteLine($"==== results: {(float)rec2Test[trainField]} ||| {mo.Score}");

Hopefully, something that I posted here is helpful to point me in the right direction.

LittleLittleCloud commented 1 year ago

@TT-Dev1 Good to see now you can use AutoML v1.x API in AutoML2.*. There're a lot of possible reasons why it gives a worse result versus AutoML 1.7 though. It might be because

The GOOD NEWS is that my ML1 code is now running to completion but still with the worse results, fewer trainings and some exceptions logged.

This might be because we use a larger search space in AutoML2.0, which brings both pros and cons. Larger search space can give better result if budget is enough, but also increase the risk of stucking in time-consuming conifugraitons (for example, numberOfTree=32468 for fast forest will cost a lot of time to train but doesn't necessarily bring a better result.) We are hoping to eliminate that effection using #6577. And you can also provide a smaller search space using AutoML2.0 API to overcome that problem

if I add OneHotEncoding to my preFeaturizer then it takes a very long time. //mlContext.Transforms.Categorical.OneHotEncoding(@"PrMorph", @"PrMorph", outputKind: OneHotEncodingEstimator.OutputKind.Binary) takes a very long time to complete!

What is PrMorph, is that a text column? One thing to note is that AutoML1.* API also applies featurizer to your dataset. In most of cases, OneHotEncoding is not time-consuming, but TextFeaturizer is not. So if PrMorph is a text column and it's inferred as text instead of category, it's very likely to add a bunch of training time.

PermutationFeatureImportance now fails with code that worked w/ 1.7.1

The error indicates that it fail to find trainer(one of fasttree|sdca|lbfgs|lgbm) in your model, which is strange. Can you share me with around 100 lines of your dataset and I can try reproduce the error.

LittleLittleCloud commented 1 year ago

BTW if you are also on discord, feel free to ping me (BigMiao#1789) and I'm happy to see what I can do to help you improve training performance

random-namespace commented 1 year ago

Hey I've experienced the same issue, though I stopped maintaining my ML code from the days when it used to attempt to predict tails instead of this- basically ML.net regression is giving up on being a ML as soon as it hits a training boundary. But this isn't practical, as any time-based, geometric, biological, or compounding model, necessarily lives on a boundary.

Please don't take the conversation to Discord; I've been following it.