dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.94k stars 1.86k forks source link

Setting CategoricalColumnNames is not Actually Doing Anything #6797

Open superichmann opened 11 months ago

superichmann commented 11 months ago

System Information (please complete the following information):

Describe the bug ColumnInformation CategoricalColumnNames suppose to instruct automl to consider specific columns as categories which in turn should increase precision on the training.

In the past (couple of months ago) I did a test with setting columns to CategoricalColumnNames, it was some prerelease version from dotnet-libraries and the training result actually was better and training time was much longer.

Today, I have re-tested this and training time and score is exactly the same as not setting CategoricalColumnNames.

To Reproduce download code (change extansion to ipynb) ColumnInformationDoesNotWork.txt open with vscode download train.csv from here run ipynb see scores are similiar for both with CI and without.

Expected behavior When adding CI the training should handle the data differently and as well produce a better score with longer training time.

Screenshots, Code, Sample Projects ColumnInformationDoesNotWork.txt

Additional context I might be missing something in parameter initialization of the process, if so please instruct me on what exactly to set.

Code Snippets

var set = new RegressionExperimentSettings();
set.MaxExperimentTimeInSeconds = 1; // Maxmodels bypass
set.Trainers.Clear();
set.Trainers.Add(RegressionTrainer.FastForest);
RegressionExperiment experiment = mlContext.Auto().CreateRegressionExperiment(set);
ColumnInformation CI = new ColumnInformation();
CI.CategoricalColumnNames.Add("family");
CI.CategoricalColumnNames.Add("store_nbr");
var x1 = experiment.Execute(train,CI);
var score1 = x1.BestRun.ValidationMetrics.RSquared;
Console.WriteLine("Result with categoricals definitions: " + score1);
var no = new RegressionExperimentSettings();
no.MaxExperimentTimeInSeconds = 1; // Maxmodels bypass
no.Trainers.Clear();
no.Trainers.Add(RegressionTrainer.FastForest);
CI.CategoricalColumnNames.Clear();
RegressionExperiment experiment2 = mlContext.Auto().CreateRegressionExperiment(no);
var x2 = experiment2.Execute(train,CI);
var score2 = x2.BestRun.ValidationMetrics.RSquared;
Console.WriteLine("Result without categoricals definitions: " + score2);

Thanks!

superichmann commented 10 months ago

image microsoft just write its the best ml product and then ignore any issues with it... nice