Closed antoniovs1029 closed 4 years ago
I believe the core of this problem is in the ImageClassificationTrainer
, on this line:
Since labelCount
holds the number of labels found on the dataset, this line hardcodes to handle a dataset with only 1 label as if it had 2 classes. I am still unsure as to why this was hardcoded this way, since, later on (after training) an exception is thrown in the KeyToValueTransformer
that is supposed to be added at the end of the pipeline, in here:
It's thrown because it can't find the KeyValues
Annotation inside the PredictedLabel
column of the output schema that results after training the ImageClassificationTrainer
.
So, for instance, if I were to remove the KeyToValue
trainer at the end of the pipeline of my source code, then, for the case of the dataset with only 1 label, the pipeline trains without problem, and the PredictedLabel
Column has the following Annotations:
Whereas, if I train it using a dataset with more than 1 label, then the annotations are as follows (notice the KeyValues annotation):
The exact reason of why the KeyValues
annotation isn't there when the dataset had only 1 label is still not a 100% clear to me. It seems to me it has to do with how Annotations are supposed to be propagated in multiclass trainers (see issue #3090 and PR #3101).
Particularly, I also noticed that the Annotations for the Score
column also changes depending if there was only 1 label in the dataset or not (if there were more than 1 labels, then the Score
column has annotations called SlotNames
and TrainingLabelValues
, if the dataset had only 1 label then those annotations are not included in the schema). For the case of the Score
column, I did figure out why were those annotations missing... and the reason is in the MulticlassClassificationScorer
:
If CanWrapTrainingLabels
and CanWrapSlotNames
return true, then the TrainingLabelValues
and SlotNames
annotations get added to the Score
column respectively. For the case of having a dataset with 1 label, those methods return false, pretty much, because there's a mismatch between the size of the Score vector (which is 2, because the ImageClassificationTrainer
was trained as if there were 2 classes) and the number of labels found by the ValueToKeyTransformer
(which would be only 1). Generally this mismatch wouldn't happen, and it only happens in here because of how the case of having a dataset with 1 label was handled in ImageClassificationTrainer
.
And even though I haven't fully figured out how does the missing annotations in the Score
column affect the PredictedLabel
annotations, I would still think it's better to simply throw an exception in ImageClassificationTrainer
when labelCount
is 1, as is done in the other multiclass trainers I linked to in the first post.
So I've just talked with @codemzs and we decided the best option is to simply throw an exception in here if labelCount
is 1:
https://github.com/dotnet/machinelearning/blob/c4e4263188dccf16903b8f3fea7e65213a69c6e3/src/Microsoft.ML.Vision/ImageClassificationTrainer.cs#L606
Instead of trying to make all the different changes required to support the corner case of having only 1 class represented on the dataset.
Issue
What did you do? I tried to train a model that uses the
ImageClassificationTrainer
, using a dataset that only contains images labeled as 'dog'.What happened? I got a
System.InvalidOperationException: 'Metadata KeyValues does not exist'
while fitting the model, which isn't very informative about the problem. If I didn't know my dataset only contained one label, or if I were to split the dataset in a way that the training set had only one label, then getting that exception wouldn't help to fix the problem.Also notice that the exception is thrown after training is done, so an user would have already invested time on it before noticing that something is wrong.
"System.InvalidOperationException: 'LightGBM Error, code is -1, error message is 'Number of classes should be specified and greater than 1 for multiclass training'.'"
, or theLinearMulticlassModelParametersBase
which throws aSystem.ArgumentOutOfRangeException: 'Must be at least 2. Parameter name: numClasses'
in here.If I changed my dataset to include other labels, then the exception is gone and it works as expected.
Source code / logs
dataset.zip
Nugets used: