Incorrect metrics when the order of labels do not correspond to the indices in multiclassification

drake7707 commented 5 years ago

System information

OS version/distro: Windows 10
.NET Version (eg., dotnet --info): .NET Core 2.2, ML.1.0.0

Issue

What did you do? Train a trainingset with LightGbm and then evaluate a test set
What happened? The printed metrics are incorrect if the labels are not ordered from 0 -> n
What did you expect? A correct confusion matrix and LogLoss, ... metrics

Confusion matrix when I add the labels ascending (0 -> n) of the samples (e.g 0,1,2,3,4,0,1,2,3,4,...). This is the correct evaluation.

          ||========================================================================================
PREDICTED ||     0 |     1 |     2 |     3 |     4 |     5 |     6 |     7 |     8 |     9 |    10 | Recall
TRUTH     ||========================================================================================
        0 ||    58 |     0 |     1 |    19 |     0 |     0 |     1 |     1 |     2 |     0 |     2 | 0,6905
        1 ||     0 |    79 |     0 |     0 |     0 |     5 |     0 |     0 |     0 |     0 |     0 | 0,9405
        2 ||     0 |     0 |    79 |     3 |     0 |     0 |     0 |     0 |     2 |     0 |     0 | 0,9405
        3 ||     0 |     0 |     0 |    84 |     0 |     0 |     0 |     0 |     0 |     0 |     0 | 1,0000
        4 ||     0 |     0 |     0 |     0 |    81 |     0 |     0 |     0 |     3 |     0 |     0 | 0,9643
        5 ||     0 |     8 |     0 |     0 |     0 |    71 |     5 |     0 |     0 |     0 |     0 | 0,8452
        6 ||     0 |     0 |     0 |     0 |     0 |     0 |    84 |     0 |     0 |     0 |     0 | 1,0000
        7 ||     0 |     0 |     0 |     0 |     0 |     2 |     0 |    82 |     0 |     0 |     0 | 0,9762
        8 ||     0 |     0 |     0 |     8 |     0 |     0 |     0 |     1 |    72 |     0 |     3 | 0,8571
        9 ||     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |    84 |     0 | 1,0000
       10 ||     0 |     0 |     0 |     0 |     0 |     2 |     1 |     0 |     0 |     0 |    81 | 0,9643
          ||========================================================================================
Precision ||1,0000 |0,9080 |0,9875 |0,7368 |1,0000 |0,8875 |0,9231 |0,9762 |0,9114 |1,0000 |0,9419 |

************************************************************

Now when I do OrderByDescending to reverse the labels and run it again, I get:

Confusion matrix when I reverse the labels (n -> 0) of the samples (e.g 4,3,2,1,0,4,3,2,1,0,...)

          ||========================================================================================
PREDICTED ||     0 |     1 |     2 |     3 |     4 |     5 |     6 |     7 |     8 |     9 |    10 | Recall
TRUTH     ||========================================================================================
        0 ||     3 |     0 |     3 |     1 |     0 |     6 |     0 |    14 |     0 |     0 |    57 | 0,0357
        1 ||     0 |     0 |     0 |     0 |     0 |     2 |     0 |     0 |     0 |    82 |     0 | 0,0000
        2 ||     0 |     0 |     3 |     0 |     0 |     0 |     0 |     8 |    73 |     0 |     0 | 0,0357
        3 ||     0 |     0 |     0 |     0 |     0 |     0 |     0 |    84 |     0 |     0 |     0 | 0,0000
        4 ||     0 |     0 |     2 |     0 |     0 |     0 |    82 |     0 |     0 |     0 |     0 | 0,0000
        5 ||     0 |     0 |     0 |     0 |     3 |    74 |     0 |     0 |     0 |     7 |     0 | 0,8810
        6 ||     0 |     0 |     0 |     0 |    83 |     0 |     0 |     0 |     0 |     0 |     1 | 0,0000
        7 ||     0 |     0 |     0 |    84 |     0 |     0 |     0 |     0 |     0 |     0 |     0 | 0,0000
        8 ||     2 |     0 |    76 |     0 |     0 |     0 |     0 |     6 |     0 |     0 |     0 | 0,0000
        9 ||     0 |    84 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 | 0,0000
       10 ||    84 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 |     0 | 0,0000
          ||========================================================================================
Precision ||0,0337 |0,0000 |0,0357 |0,0000 |0,0000 |0,9024 |0,0000 |0,0000 |0,0000 |0,0000 |0,0000 |

I think there is an expectation somewhere that the label == the label index.

Source code / logs

            var trainingDataView = mlContext.Data.LoadFromEnumerable(trainingDataArray, schemaDef);
            var testDataView = mlContext.Data.LoadFromEnumerable(testDataArray, schemaDef);

            var featureNames = typeof(RecordFeatures).GetProperties().Where(p => p.Name != nameof(RecordFeatures.Label)).Select(p => p.Name).ToArray();

            var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: "KeyColumn", inputColumnName: nameof(RecordFeatures.Label))
                                                                       .Append(mlContext.Transforms.Concatenate("Features", featureNames))
                                                                       .AppendCacheCheckpoint(mlContext);

            var trainer = mlContext.MulticlassClassification.Trainers.LightGbm(labelColumnName: "KeyColumn", featureColumnName: "Features");

            var trainingPipeline = dataProcessPipeline.Append(trainer);

            Console.WriteLine("=============== Training the model ===============");
            var trainedModel = trainingPipeline.Fit(trainingDataView);

            Console.WriteLine("===== Evaluating Model's accuracy with Test data =====");
            var predictions = trainedModel.Transform(testDataView);
            var metrics = mlContext.MulticlassClassification.Evaluate(predictions, "Label", "Score");

            PrintMultiClassClassificationMetrics(trainer.ToString(), metrics);

PeterPann23 commented 5 years ago

Hi Drake

It's in the order it read the classes into memory. When you configure the pipeline try: var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: "KeyColumn", inputColumnName: nameof(RecordFeatures.Label) ,keyOrdinality:ValueToKeyMappingEstimator.KeyOrdinality.ByValue);

Have a look at ValueToKeyMappingEstimator, has 2 parameters.

This is related to issue #3769

wschin commented 5 years ago

The predicted label of multi-class classifier is the ordinal of your index. For example, if it produces 0, it means the first label.

PeterPann23 commented 5 years ago

Well, not really, depends on the sort provided and the sort needs to be looked at as Option 1 it sorts based on the order the date was loaded Option 2 it sorts based on the ASCII alphabet

Both are not that help full if the label is a annotation where the values have meaning like rating as a miss is a miss based on a scale "Desaster","Really bad", "Bad", "Okey", "Good", "Really good", "Ausom".

When looking at the confusion matrix It would not be so bad when the scale tips to the neighbor, really bad if there is no visual pattern on the prediction. Please note that the Confusion Matrix is all about projecting patterns, hard to see them if it's "random" ordered

drake7707 commented 5 years ago

Thanks for the explanation, it's still pretty confusing (no pun intended) though.

If I understand it correctly, the Accuracy and LogLoss metrics are completely dependent on the order of the samples. I would expect that the order of the labels doesn't matter, and the metrics, which indicate the performance of the model after all would always be the same, regardless of the order of the samples in the test set.

When I changed my training/test set to get a better balance between the classes, it looked as if the model just didn't train at all, while in fact it performed better. It was just the metrics and confusion matrix that made it look that way. I got lucky and happened to notice the shift in the pattern in my confusion matrix and adjusted my set accordingly but this behaviour isn't at all intuitive to someone new to ML.NET, ergo why I opened this issue.

PeterPann23 commented 5 years ago

Yes, It's a bit vodoo at the moment, it was worse in the past and I opened quite a few tickets myself. I found that executing the model using the predict function and make my own statistics helps understanding me to the quality of my model much better, I can't reproduce the same metrics executing the rows as what evaluate function returns; same data different results.

I also noticed that the "predicted" multi-class not always correspond with the best prediction and that number predictions sometimes are off by a margin, I run a 2nd model against the prediction just to get the result "normalized".

When you look you'll notice that the code work well on the test samples, like the code is tuned to work on unit tests and not against live data. I have not noticed as good a score with production data using this framework. I also noted that it can't really deal with large data-sets, as soon as you go above say few 100GB the it just loops until you run out of patience.

You will find some nice samples in the source code of how to implement the models, documentation is getting better as well. Read through the posts here and you will find lots of solutions to the same issues that you will run-into.

Have a look at AutoML, give it your dataset, it will train several models and will generate the code after "tuning" it. At the moment it doesn't always pics the best model but that bug is being worked on already. I find that AutoML is a nice "bootstrap" and definitely gets you started.

harishsk commented 4 years ago

@drake7707 I am assuming your query has been answered. I am closing the issue. Please reopen if necessary.

dotnet / machinelearning