dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
265 stars 56 forks source link

Categorical prediction handled as non-catagorical. #2763

Open Elantonio opened 1 year ago

Elantonio commented 1 year ago

Scenario Value Prediction

Data with a categorical column2predict (values -1,0,1)

Evaluate gives prediction -0.01 !?!? image

Prediction should be -1, 0 or 1!

LittleLittleCloud commented 1 year ago

Can you try "Data Classification" scenario? You are playing with Regression model and that's why you get a numeric result

Elantonio commented 1 year ago

Can you try "Data Classification" scenario? You are playing with Regression model and that's why you get a numeric result

As it is possible to mark the Label column as Categorical I assumed it had some use. So I marked the Label column as categorical, assuming it would then only choose from the 3 available values(taking the closest one to the regression result). I could round it and get the Category that way. But then what is the use of being able to mark the Label column as Categorical?

LittleLittleCloud commented 1 year ago

Sorry, Maybe I didn't make myself clear... What I mean is you can choose the Data Classification card on the first Scenario tab (see below picture). Based on the screenshot you share with us, it looks like you pick the value prediction card, which uses regression model to fit a numeric label and that's why you are getting non-catagorical prediction value

image

Elantonio commented 1 year ago

Hi,

You said it very clear. The point I was making is that, in my logic, in regression a value marked as categorical and having only 3 values should be predicted as categorical, thus one of the 3 values. (Taking the available values in the train set as being the categories). Otherwise there is no use/effect in declaring the predict column as category. And is it is of no use/effect it should not be possible to set the predict column as category.

Thanks for your effort and thinking with me!

Best Rgds, Ton

From: Xiaoyun Zhang @.*** Sent: zaterdag 2 september 2023 21:01 To: dotnet/machinelearning-modelbuilder Cc: ElAntonio; Author Subject: Re: [dotnet/machinelearning-modelbuilder] Categorical prediction handled as non-catagorical. (Issue #2763)

Sorry, Maybe I didn't make myself clear... What I mean is you can choose the Data Classification card on the first Scenario tab (see below picture). Based on the screenshot you share with us, it looks like you pick the value prediction card, which uses regression model to fit a numeric label and that's why you are getting non-catagorical prediction value

https://user-images.githubusercontent.com/16876986/265224416-ebbdfe4a-156f-4f5d-92ec-c7dc9478e424.png Image removed by sender. image

— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning-modelbuilder/issues/2763#issuecomment-1703913487 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYZBS7CI57CB34MS4ZAHALXYN645ANCNFSM6AAAAAA36X3QRM . You are receiving this because you authored the thread.Image removed by sender.Message ID: @.***>

Elantonio commented 1 year ago

Sorry to bother you but do you know where i can report a bug?

BUG: at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at System.Threading.Tasks.Task1.get_Result() at Microsoft.ML.AutoML.AutoMLExperiment.Run() at Microsoft.ML.AutoML.RegressionExperiment.Execute(IDataView trainData, ColumnInformation columnInformation, IEstimator1 preFeaturizer, IProgress1 progressHandler) at Microsoft.ML.AutoML.RegressionExperiment.Execute(IDataView trainData, String labelColumnName, String samplingKeyColumn, IEstimator1 preFeaturizer, IProgress1 progressHandler) at cAlgo.Predictor.<>c__DisplayClass12_0.b1() in C:\Users\Ton\Documents\MLPredict\MLPredict\Predictor.cs:line 94 at System.Threading.Tasks.Task`1.InnerInvoke() at System.Threading.Tasks.Task.<>c.<.cctor>b271_0(Object obj) at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)

Source:

public async Task TrainAsync( )

    {

        IsTrained = false;

        // Extract header row

        try

        {

            var FileSizeMB = (double)(new FileInfo(TrainData.DataPath).Length) / 1024 / 1024;

            var TrainTimeInSeconds =(uint) Math.Min(10,( FileSizeMB *FileSizeMB * TrainTimeFactor / 100*60).RoundUp(0));

            ColumnInferenceResults columnInference =

                   MyContext.Auto().InferColumns(TrainData.DataPath, labelColumnName: LabelName, groupColumns: false);

            foreach (var name in TrainData.CategoricalColumnNames)

            {

                columnInference.ColumnInformation.NumericColumnNames.Remove(name);

               columnInference.ColumnInformation.CategoricalColumnNames.Add(name);

            }

            // Create IDataView from data

            TextLoader loader = MyContext.Data.CreateTextLoader(columnInference.TextLoaderOptions);

           IDataView DataView = loader.Load(TrainData.DataPath);

            DataView = MyContext.Data.ShuffleRows(DataView);

            // Define experiment settings

            var experimentSettings = new RegressionExperimentSettings

           {

                MaxExperimentTimeInSeconds = TrainTimeInSeconds,

                OptimizingMetric = RegressionMetric.RSquared,

                CacheBeforeTrainer = CacheBeforeTrainer.Auto

            };

            // Create experiment

            var experiment = MyContext.Auto().CreateRegressionExperiment(experimentSettings);

            var progressHandler = new Progress<RunDetail<RegressionMetrics>>(p =>

            {

                DebugWrite($"Current result - TrainerName: {p.TrainerName}, RuntimeInSeconds: {p.RuntimeInSeconds}, ValidationMetrics: {p.ValidationMetrics}");

            });

            // Run experiment

(ERRORLINE 94) var result = Task.Run(() => experiment.Execute(DataView, labelColumnName: LabelName, progressHandler: progressHandler));

            // Get best model

            var model = result.Result.BestRun.Model;

            RSquared = result.Result.BestRun.ValidationMetrics.RSquared;

            // Create prediction engine

            PredictionEngine = MyContext.Model.CreatePredictionEngine<dynamic, ModelOutput>(model);

            IsTrained = true;

        }

        catch (Exception ex) { System.Diagnostics.Debug.WriteLine(ex.Message + " " + ex.StackTrace.ToString()); }

    }

From: Xiaoyun Zhang @.*** Sent: zaterdag 2 september 2023 21:01 To: dotnet/machinelearning-modelbuilder Cc: ElAntonio; Author Subject: Re: [dotnet/machinelearning-modelbuilder] Categorical prediction handled as non-catagorical. (Issue #2763)

Sorry, Maybe I didn't make myself clear... What I mean is you can choose the Data Classification card on the first Scenario tab (see below picture). Based on the screenshot you share with us, it looks like you pick the value prediction card, which uses regression model to fit a numeric label and that's why you are getting non-catagorical prediction value

https://user-images.githubusercontent.com/16876986/265224416-ebbdfe4a-156f-4f5d-92ec-c7dc9478e424.png image

— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning-modelbuilder/issues/2763#issuecomment-1703913487 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYZBS7CI57CB34MS4ZAHALXYN645ANCNFSM6AAAAAA36X3QRM . You are receiving this because you authored the thread. https://github.com/notifications/beacon/ABYZBS76TUEMCOJGEDQQNWDXYN645A5CNFSM6AAAAAA36X3QROWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTFR6UA6.gif Message ID: @.***>

LittleLittleCloud commented 1 year ago

You can report it here. What's the exception message you get? Is that something related to non-numeric label value for regression trainers?

Elantonio commented 1 year ago

No it is not related to non-numeric label value for regression trainers; that I would not have perceived as a bug.

Sorry i forgot to mention the exception that (stupid) Here it is: NullReferenceException: Object reference not set to an instance of an object.

Best rgds, Ton

From: Xiaoyun Zhang @.*** Sent: maandag 4 september 2023 04:56 To: dotnet/machinelearning-modelbuilder Cc: ElAntonio; Author Subject: Re: [dotnet/machinelearning-modelbuilder] Categorical prediction handled as non-catagorical. (Issue #2763)

You can report it here. What's the exception message you get? Is that something related to non-numeric label value for regression trainers?

— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning-modelbuilder/issues/2763#issuecomment-1704540662 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYZBS335YAMZT5IVJL7VH3XYU7K5ANCNFSM6AAAAAA36X3QRM . You are receiving this because you authored the thread. https://github.com/notifications/beacon/ABYZBS6AAU43WEOK7EJGVD3XYU7K5A5CNFSM6AAAAAA36X3QROWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTFTE47M.gif Message ID: @.***>

LittleLittleCloud commented 1 year ago

That bug sounds so familiar to me. Could it be similar to this issue? https://github.com/dotnet/machinelearning/issues/6558

Elantonio commented 1 year ago

There are similarities

In my also running the experiment from the modelbuilder on csv works OK

There are differences

I’m running ML 2.0.1 and AutoML 0.20.1

But nevertheless I tried the fix (specifying TrainSet & TestSet)

and now …….

IT WORKS !!

Thanks a very big lot! although it stays a bug ;-)

Best rgds,

Ton

From: Xiaoyun Zhang @.*** Sent: dinsdag 5 september 2023 19:53 To: dotnet/machinelearning-modelbuilder Cc: ElAntonio; Author Subject: Re: [dotnet/machinelearning-modelbuilder] Categorical prediction handled as non-catagorical. (Issue #2763)

That bug sounds so familiar to me. Could it be similar to this issue? dotnet/machinelearning#6558 https://github.com/dotnet/machinelearning/issues/6558

— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning-modelbuilder/issues/2763#issuecomment-1707053009 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYZBSYE5DFXIRTYUKHMVPTXY5RHNANCNFSM6AAAAAA36X3QRM . You are receiving this because you authored the thread.https://github.com/notifications/beacon/ABYZBS7QFUMBEOFMXX6MV7DXY5RHNA5CNFSM6AAAAAA36X3QROWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTFX6H5C.gifMessage ID: @.***>

LittleLittleCloud commented 1 year ago

Cool, glad to see you figure this out!

Elantonio commented 1 year ago

Hi Xiaoyun,

I was glad it worked too, but a little later it went haywire again!

System.NullReferenceException

HResult=0x80004003

Message=Object reference not set to an instance of an object.

Source=Microsoft.ML.AutoML

StackTrace:

at Microsoft.ML.AutoML.SweepablePipeline..ctor(Dictionary`2 estimators, Entity schema, String currentSchema)

at Microsoft.ML.AutoML.SweepablePipeline.AppendEntity(Boolean allowSkip, Entity entity)

at Microsoft.ML.AutoML.RegressionExperiment.CreateRegressionPipeline(IDataView trainData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer)

at Microsoft.ML.AutoML.RegressionExperiment.Execute(IDataView trainData, IDataView validationData, ColumnInformation columnInformation, IEstimator1 preFeaturizer, IProgress1 progressHandler)

at Microsoft.ML.AutoML.RegressionExperiment.Execute(IDataView trainData, IDataView validationData, String labelColumnName, IEstimator1 preFeaturizer, IProgress1 progressHandler)

at cAlgo.Predictor.d__16.MoveNext() in C:\Users\Ton\Documents\Sources\MLPredict\MLPredict\Predictor.cs:line 175

on the following code:

   public async Task<(double Rsquared, bool IsTrained, PredictionEngine<ExpandoObject, ModelOutput> PredictionEngine)> Train2Async()

    {

        try

        {   // Extract header row

            var FileSizeMB = (double)(new FileInfo(TrainData.DataPath).Length) / 1024 / 1024;

            var TrainTimeInSeconds = (uint)Math.Max(10, Math.Sqrt(FileSizeMB) * TrainTimeFactor * 100).RoundUp(0);

            var MyContext = new MLContext();

            ColumnInferenceResults columnInference =

                   MyContext.Auto().InferColumns(TrainData.DataPath, labelColumnName: LabelName, groupColumns: false);

            foreach (var name in TrainData.CategoricalColumnNames)

            {

                columnInference.ColumnInformation.NumericColumnNames.Remove(name);

                columnInference.ColumnInformation.CategoricalColumnNames.Add(name);

            }

            // Load data

            var data = MakeExpando();

            // Convert data to IDataView

            var dataView = MyContext.Data.LoadFromEnumerable(data);

            // Split data into training and test sets

            var trainTestSplit = MyContext.Data.TrainTestSplit(dataView);

            // Define experiment settings

            var experimentSettings = new RegressionExperimentSettings

            {

                MaxExperimentTimeInSeconds = 60,

                OptimizingMetric = RegressionMetric.RSquared,

                CacheBeforeTrainer = CacheBeforeTrainer.Auto

            };

            // Create experiment

            var experiment = MyContext.Auto().CreateRegressionExperiment(experimentSettings);

            // Run experiment

ERRORLINE=> var result = experiment.Execute(trainTestSplit.TrainSet, trainTestSplit.TestSet);

            // Get best model

            var model = result.BestRun.Model;

            //

            RSquared = result.BestRun.ValidationMetrics.RSquared;//.result.Result.BestRun.ValidationMetrics.RSquared;

            // Get feature importance

            var featureImportance = MyContext.Regression.PermutationFeatureImportance(model, trainTestSplit.TestSet);

            var featureImportanceValues = featureImportance.Select(x => x.Value.RSquared.Mean).ToArray();

            var featureNames = data.First().Select(x => x.Key).ToArray();

            // Print feature importance

            for (int i = 0; i < featureNames.Length; i++)

            {

                Console.WriteLine($"{featureNames[i]}: {featureImportanceValues[i]:0.00}");

            }

            // Create prediction engine

            var predictionEngine = MyContext.Model.CreatePredictionEngine<ExpandoObject, ModelOutput>(model);

            return (RSquared, IsTrained, PredictionEngine);

        }

        catch (Exception ex) { Lib.LogPrint(ThisRobot, ex.Message + " " + ex.StackTrace.ToString()); }

        return (double.NaN, false, null);

    }

    List<ExpandoObject> MakeExpando()

    {

        var data = new List<ExpandoObject>();

        // Read header row

        var header = TrainData.HeaderCollection;

        // Read data rows

        foreach (var row in TrainData.Rows)

        {

            var line = row.ToArray();

            dynamic dataPoint = new ExpandoObject();

            for (int i = 0; i < header.Length; i++)

            {

                ((IDictionary<string, object>)dataPoint)[header[i]] = (float)line[i];

            }

            data.Add(dataPoint);

        }

        return data;

    }

What am I doing Wrong here? I just want to do an AutoML run on a dataset that has no defined columns at instantiation.

Only after loading the csv the columns and their header are known (similar to mbconfig)

Best Rgds,

Ton

From: Xiaoyun Zhang @.*** Sent: woensdag 6 september 2023 19:24 To: dotnet/machinelearning-modelbuilder Cc: ElAntonio; Author Subject: Re: [dotnet/machinelearning-modelbuilder] Categorical prediction handled as non-catagorical. (Issue #2763)

Cool, glad to see you figure this out!

— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning-modelbuilder/issues/2763#issuecomment-1708802643 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYZBS6KSQHGN4TXGKUZGADXZCWT5ANCNFSM6AAAAAA36X3QRM . You are receiving this because you authored the thread. https://github.com/notifications/beacon/ABYZBS56SO3KS4F3ZVKIWQTXZCWT5A5CNFSM6AAAAAA36X3QROWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTF3JBFG.gif Message ID: @.***>

LittleLittleCloud commented 1 year ago

@Elantonio Would you still get the error if you remove the following lines?

                foreach (var name in TrainData.CategoricalColumnNames)

                {

                    columnInference.ColumnInformation.NumericColumnNames.Remove(name);

                    columnInference.ColumnInformation.CategoricalColumnNames.Add(name);

                }
Elantonio commented 1 year ago

Hi Xiaoyun,

Sorry no avail, error is still the same System.NullReferenceException

HResult=0x80004003

Message=Object reference not set to an instance of an object.

Source=Microsoft.ML.AutoML

StackTrace:

at Microsoft.ML.AutoML.SweepablePipeline..ctor(Dictionary`2 estimators, Entity schema, String currentSchema)

at Microsoft.ML.AutoML.SweepablePipeline.AppendEntity(Boolean allowSkip, Entity entity)

at Microsoft.ML.AutoML.RegressionExperiment.CreateRegressionPipeline(IDataView trainData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer)

at Microsoft.ML.AutoML.RegressionExperiment.Execute(IDataView trainData, IDataView validationData, ColumnInformation columnInformation, IEstimator1 preFeaturizer, IProgress1 progressHandler)

at Microsoft.ML.AutoML.RegressionExperiment.Execute(IDataView trainData, IDataView validationData, String labelColumnName, IEstimator1 preFeaturizer, IProgress1 progressHandler)

Best rgds, Ton.

PS I feel I’m missing a clue on how to train dynamic input and reusing its predictionengine and having columnimportance info. do you know of a sample where this has been done before?

From: Xiaoyun Zhang @.*** Sent: vrijdag 8 september 2023 00:39 To: dotnet/machinelearning-modelbuilder Cc: ElAntonio; Mention Subject: Re: [dotnet/machinelearning-modelbuilder] Categorical prediction handled as non-catagorical. (Issue #2763)

@Elantonio https://github.com/Elantonio Would you still get the error if you remove the following lines?

            foreach (var name in TrainData.CategoricalColumnNames)

            {

                columnInference.ColumnInformation.NumericColumnNames.Remove(name);

                columnInference.ColumnInformation.CategoricalColumnNames.Add(name);

            }

— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning-modelbuilder/issues/2763#issuecomment-1710849795 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYZBS5PHRJRS3ER26BGKK3XZJEJXANCNFSM6AAAAAA36X3QRM . You are receiving this because you were mentioned. https://github.com/notifications/beacon/ABYZBSZUUHKWHWSARB3JAODXZJEJXA5CNFSM6AAAAAA36X3QROWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTF7F7QG.gif Message ID: @.***>