System Information (please complete the following information):
OS & Version: Windows 10
ML.NET Version: Microsoft.ML, 2.0.1
Microsoft.ML.AutoML, 0.20.1
BUT ALSO ON LATEST PreRelease from dotnet-libraries
.NET Version: 7.0
Describe the bug
ColumnInformation CategoricalColumnNames suppose to instruct automl to consider specific columns as categories which in turn should increase precision on the training.
In the past (couple of months ago) I did a test with setting columns to CategoricalColumnNames, it was some prerelease version from dotnet-libraries and the training result actually was better and training time was much longer.
Today, I have re-tested this and training time and score is exactly the same as not setting CategoricalColumnNames.
To Reproduce
download code (change extansion to ipynb) ColumnInformationDoesNotWork.txt
open with vscode
download train.csv from here
run ipynb
see scores are similiar for both with CI and without.
Expected behavior
When adding CI the training should handle the data differently and as well produce a better score with longer training time.
Additional context
I might be missing something in parameter initialization of the process, if so please instruct me on what exactly to set.
Code Snippets
var set = new RegressionExperimentSettings();
set.MaxExperimentTimeInSeconds = 1; // Maxmodels bypass
set.Trainers.Clear();
set.Trainers.Add(RegressionTrainer.FastForest);
RegressionExperiment experiment = mlContext.Auto().CreateRegressionExperiment(set);
ColumnInformation CI = new ColumnInformation();
CI.CategoricalColumnNames.Add("family");
CI.CategoricalColumnNames.Add("store_nbr");
var x1 = experiment.Execute(train,CI);
var score1 = x1.BestRun.ValidationMetrics.RSquared;
Console.WriteLine("Result with categoricals definitions: " + score1);
var no = new RegressionExperimentSettings();
no.MaxExperimentTimeInSeconds = 1; // Maxmodels bypass
no.Trainers.Clear();
no.Trainers.Add(RegressionTrainer.FastForest);
CI.CategoricalColumnNames.Clear();
RegressionExperiment experiment2 = mlContext.Auto().CreateRegressionExperiment(no);
var x2 = experiment2.Execute(train,CI);
var score2 = x2.BestRun.ValidationMetrics.RSquared;
Console.WriteLine("Result without categoricals definitions: " + score2);
System Information (please complete the following information):
Describe the bug ColumnInformation CategoricalColumnNames suppose to instruct automl to consider specific columns as categories which in turn should increase precision on the training.
In the past (couple of months ago) I did a test with setting columns to CategoricalColumnNames, it was some prerelease version from dotnet-libraries and the training result actually was better and training time was much longer.
Today, I have re-tested this and training time and score is exactly the same as not setting CategoricalColumnNames.
To Reproduce download code (change extansion to ipynb) ColumnInformationDoesNotWork.txt open with vscode download train.csv from here run ipynb see scores are similiar for both with CI and without.
Expected behavior When adding CI the training should handle the data differently and as well produce a better score with longer training time.
Screenshots, Code, Sample Projects ColumnInformationDoesNotWork.txt
Additional context I might be missing something in parameter initialization of the process, if so please instruct me on what exactly to set.
Code Snippets
Thanks!