Open gagy3798 opened 4 years ago
Hi @gagy3798,
I have been able to reproduce your results. I believe the method in which you are acquiring the model parameters with the following line is incorrect, and you are not correctly casting MaximumEntropyModelParameters:
var originalModelParameters = (dataPrepPipeline as TransformerChain<ITransformer>).LastTransformer as MulticlassPredictionTransformer<MaximumEntropyModelParameters>;
When I ran your code in debug mode and inspected originalModelParameters.Model
, I see that MaximumEntropyModelParameters originalModelParameters
is not loaded correctly:
Here, Model
has its values under Non-Public members, whereas they should be accessible normally. This is why you're receiving the "No valid training instances found" exception.
Hi @mstfbl thank you for inspecting. I use ML.NET 1.5.0 and I dont see this values as Non-Public Members but public members.
Hi @gagy3798,
My teammate Antonio @antoniovs1029 and I debugged in detail your code and error. Thank you Antonio for your help. :D
We realized that you are training your _trainedModel
model (without the MapKeyToValue("Area", "PredictedLabel")
) to make predictions on the labels of your first dataset (which range between the values 11-13), and then attempting to retrain your original model with LbfgsMaximumEntropy
on the labels of your second dataset (which range between 14-17). This is a problem, as your pre-processing pipeline has a ValueToKey mapping transformer
on the Label
column. When trained on the 1st dataset, it learns to map the Label value "11" to "1", "12" to "2", and "13" to "3" However, when you use this same trained transformer to map on the 2nd dataset, the Label values 14-17 map all to 0, as it's never encountered these values before. These 0
values are interpreted as missing labels, hence the exact error. I've confirmed that these Label values of 14-17 indeed map to 0 on my reproduction.
Put simply, you're asking MulticlassClassification.Trainers.LbfgsMaximumEntropy
to map to values it's never seen before, because of the fact that the same ValueToKey mapping transformer is used.
There are two ways you can fix this:
MapKeyToValue
from your pre-processing pipeline, and use a new pipeline which consists of your pre-processing pipeline plus a new MapKeyToValue transformer before you train and also before you retrain your models.However, this issue is a clear sign that ML.NET does not warn the user that values is does not know how to map are by default mapped to 0
, which should not be the case.
Hi @mstfbl
OK, fix 1 works, but it means I can't have a new category (label) when retraining model. I made changes to data1.csv so it contains all categories. After retraining I get no exceptions, but new model has 0 accuracy and everything I try to predict is correct, also categories which are absolutely not in training data (category 999). Something is bad.
Fix 2 - maybe I dont understand it, I already have pipeline without and also with MapKeyToValue, but I'm not able to get it working
I actualized code
[data1.zip](https://github.com/dotnet/machinelearning/files/4757484/data1.zip)
using GitHubIssueClassification;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ConsoleApp1
{
public class GitHubIssueClassification
{
static List<GitHubIssueTransformed> testDatas = new List<GitHubIssueTransformed>()
{
new GitHubIssueTransformed() {Area="11", Title="WHIRLPOOL AWE 50610", Description="" },
new GitHubIssueTransformed() {Area="13", Title="GORENJE K5151WH", Description="" },
new GitHubIssueTransformed() {Area="13", Title="sporák", Description="" },
new GitHubIssueTransformed() {Area="14", Title="FAGOR 4CC-140", Description="" },
new GitHubIssueTransformed() {Area="19", Title="AEG T8DFE68SC", Description="" },
new GitHubIssueTransformed() {Area="999", Title="TEST 999", Description="" }
};
private static string _appPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
private static string _mainDataPath1 => Path.Combine(_appPath, "..", "..", "..", "Data", "data1.csv");
private static string _mainDataPath2 => Path.Combine(_appPath, "..", "..", "..", "Data", "data2.csv");
private static string _modelPath => Path.Combine(_appPath, "..", "..", "..", "Models", "trainedModel.zip");
private static string _keyToValueModelPath => Path.Combine(_appPath, "..", "..", "..", "Models", "keyToValueModel.zip");
static DataOperationsCatalog.TrainTestData splittedData;
private static MLContext _mlContext;
private static PredictionEngine<GitHubIssueTransformed, IssuePrediction> _predEngine;
private static ITransformer _trainedModel { get; set; }
private static ITransformer _keyToValueModel { get; set; }
static IDataView _trainingDataView;
public static void Run()
{
_mlContext = new MLContext(seed: 0);
var allData = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_mainDataPath1, hasHeader: true);
splittedData = _mlContext.Data.TrainTestSplit(allData, testFraction: 0.2);
_trainingDataView = splittedData.TrainSet;
Console.WriteLine($"=============== Loading Dataset data1.csv (initial data) ===============");
var transformedData = BuildAndTrainModel(_trainingDataView);
SecondLap(_mlContext);
}
public static IDataView BuildAndTrainModel(IDataView trainingDataView)
{
var pipeline = _mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "Area", outputColumnName: "Label")
.Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Title", outputColumnName: "TitleFeaturized"))
.Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description", outputColumnName: "DescriptionFeaturized"))
.Append(_mlContext.Transforms.Concatenate("Features", "TitleFeaturized", "DescriptionFeaturized"))
.AppendCacheCheckpoint(_mlContext);
var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy("Label", "Features"));
var keyToValuePipeline = trainingPipeline.Append(_mlContext.Transforms.Conversion.MapKeyToValue("Area", "PredictedLabel"));
Console.WriteLine($"=============== Training the model ===============");
_trainedModel = trainingPipeline.Fit(trainingDataView);
var transformedData = _trainedModel.Transform(trainingDataView);
_keyToValueModel = keyToValuePipeline.Fit(transformedData);
_mlContext.Model.Save(_trainedModel, trainingDataView.Schema, _modelPath);
_mlContext.Model.Save(_keyToValueModel, transformedData.Schema, _keyToValueModelPath);
Console.WriteLine($"=============== Finished Training the model Ending time: {DateTime.Now.ToString()} ===============");
Evaluate(_trainingDataView.Schema, transformedData, splittedData.TestSet);
SinglePredictionFromMemory();
PredictIssue_FirstLoadModelFromDisk();
return transformedData;
}
public static void Evaluate(DataViewSchema trainingDataViewSchema, IDataView transformedData, IDataView testDataView2 = null)
{
Console.WriteLine($"=============== Evaluating to get model's accuracy metrics - Starting time: {DateTime.Now.ToString()} ===============");
IDataView testDataView = testDataView2;
var testMetrics = _mlContext.MulticlassClassification.Evaluate(_trainedModel.Transform(testDataView));
Console.WriteLine($"=============== Evaluating to get model's accuracy metrics - Ending time: {DateTime.Now.ToString()} ===============");
Console.WriteLine($"*************************************************************************************************************");
Console.WriteLine($"* Metrics for Multi-class Classification model - Test Data ");
Console.WriteLine($"*------------------------------------------------------------------------------------------------------------");
Console.WriteLine($"* MicroAccuracy: {testMetrics.MicroAccuracy:0.###}");
Console.WriteLine($"* MacroAccuracy: {testMetrics.MacroAccuracy:0.###}");
Console.WriteLine($"* LogLoss: {testMetrics.LogLoss:#.###}");
Console.WriteLine($"* LogLossReduction: {testMetrics.LogLossReduction:#.###}");
Console.WriteLine($"*************************************************************************************************************");
}
static void SinglePredictionFromMemory()
{
// (OPTIONAL) Try/test a single prediction with the "just-trained model" (Before saving the model)
Console.WriteLine($"=============== Single Prediction just-trained-model ===============");
_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssueTransformed, IssuePrediction>(_keyToValueModel);
foreach (var testIssue in testDatas)
{
var prediction = _predEngine.Predict(testIssue);
if (prediction.Area.ToString() != testIssue.Area.ToString())
Console.ForegroundColor = ConsoleColor.Red;
else
Console.ForegroundColor = ConsoleColor.Blue;
Console.WriteLine($"=============== predicted result: {prediction.Area} - should be: {testIssue.Area} - {testIssue.Title} ===============");
}
Console.ResetColor();
}
public static void PredictIssue_FirstLoadModelFromDisk()
{
Console.WriteLine("=============== Single Prediction model-loaded-from-disk ===============");
//ITransformer loadedModel = _mlContext.Model.Load(_modelPath, out var modelInputSchema);
ITransformer loadedModel = _mlContext.Model.Load(_keyToValueModelPath, out var modelInputSchema);
foreach (var testIssue in testDatas)
{
_predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssueTransformed, IssuePrediction>(loadedModel);
var prediction = _predEngine.Predict(testIssue);
if (prediction.Area.ToString() != testIssue.Area.ToString())
Console.ForegroundColor = ConsoleColor.Red;
else
Console.ForegroundColor = ConsoleColor.Blue;
Console.WriteLine($"=============== predicted result: {prediction.Area} - should be: {testIssue.Area} - {testIssue.Title} ===============");
Console.ResetColor();
}
}
static void SecondLap(MLContext _mlContext)
{
Console.WriteLine("\nSecondLap - retrain new data\n");
Console.WriteLine($"=============== Loading Dataset data2.csv (new data) ===============");
var newData = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_mainDataPath2, hasHeader: true);
splittedData = _mlContext.Data.TrainTestSplit(newData, testFraction: 0.2);
_trainedModel = _mlContext.Model.Load(_modelPath, out var _);
_keyToValueModel = _mlContext.Model.Load(_keyToValueModelPath, out var _);
var originalModelParameters = (_trainedModel as TransformerChain<ITransformer>).LastTransformer
as MulticlassPredictionTransformer<MaximumEntropyModelParameters>;
var transformedData = _trainedModel.Transform(splittedData.TrainSet);
_keyToValueModel = _mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy("Label", "Features")
.Fit(transformedData, originalModelParameters.Model);
_mlContext.Model.Save(_keyToValueModel, transformedData.Schema, _keyToValueModelPath);
Evaluate(splittedData.TrainSet.Schema, transformedData, splittedData.TestSet);
SinglePredictionFromMemory();
PredictIssue_FirstLoadModelFromDisk();
}
}
public class GitHubIssue
{
[LoadColumn(0)]
public string ID { get; set; }
[LoadColumn(1)]
public string Area { get; set; }
[LoadColumn(2)]
public string Title { get; set; }
[LoadColumn(3)]
public string Description { get; set; }
}
public class GitHubIssueTransformed : GitHubIssue
{
[VectorType(38380)]
public float[] Features;
}
public class IssuePrediction
{
//[ColumnName("PredictedLabel")]
public string Area;
//[VectorType(66667)]
//float[] Features;
}
}
Hi @gagy3798 ,
Currently, retraining of classification models with differing labels is not supported on ML.NET. This can be a feature request for future implementation, but as it stands now, labels that are unseen before retraining are inputted as 0
, which result in the "No valid training instances found, all instances have missing features."
For the time being, one way to get around this is to add extra data to data1.csv
, where you introduce the label values 14-17. The extra rows that you add can be empty except for the labels. This way, during retraining, as the model would have had already seen labels between 11-17, you would not get this error. However I do not know how mathematically correct this would be.
Hi @mstfbl,
thank you for support. Ok, I gave all labels in first training data batch. Another important problem is, that after retraining, I get no exception, but I will get no more predictions and totally loose model accuracy.
Hi @gagy3798,
Yes, this inaccuracy is due to the fact that the labels 14-17 are added with no real data, so the retraining doe not have any real data to retrain from. So for the time being, your best best would be to match the labels in data1.csv
with labels in data2.csv
, this way the labels can be accuracy mapped and retraining will provide much better results.
Hi @mstfbl,
I mixed data between files data1.csv and data2.csv as you suggested and after retraining now I have some MicroAccuracy (little bit worser than its after initial training). But after retraining I still cannot predict - prediction returns always no label.
After changing data in data1.csv I always get exception f.e. System.InvalidOperationException: 'Incompatible features column type: 'Vector<Single, 45576>' vs 'Vector<Single, 36473>'' After I change VectorType(XXXXX) to new number its ok. Isn't there better way how to get array size than waiting for exception?
public class GitHubIssueTransformed : GitHubIssue
{
[VectorType(40649)]
public float[] Features;
}
I might recommend using the keyData
parameter in your MapValueToKey()
. Assuming you know all the possible class names beforehand, this allows the MapValueToKey
to store a mapping for all class names.
Once the MapValueToKey
creates the correct Key
range, I expect the trainer will also create weights for these classes even if no examples are present. Without examples (data rows) of the future classes, the trainer should still get good accuracy on the classes present in the first round of training, while having the ability to fill in the model weights for them later.
Thank you. Its cool.
But problem is that Im still not able to get any prediction after retraining.
System.InvalidOperationException: 'Incompatible features column type: 'Vector<Single, 45576>' vs 'Vector<Single, 36473>''
This is due to refitting the text transform (FeaturizeText
). The output vector from the second run doesn't match the original (either in size or composition).
After I change VectorType(XXXXX) to new number its ok. Isn't there better way how to get array size than waiting for exception?
Resizing the vector to the original size won't help as the slots will be moved around. For instance, the word "cat" in the first run landed in slot 10328, but in the second ends up in slot 9310. In the dictionary based (non-hashing) option of the text transform, slot numbers are assigned in the order the words are seen in the training dataset. The changing of the slots would mean the model's existing weights would not be helpful.
There are two options:
xf=LoadTransform{in=ExistingModel.zip}
to load the previously trained transforms. Perhaps in the Estimators API, you can just use the existing model to featurize the dataset, then in your refitting pipeline, include only final trainer (and not the featurization pipeline steps).
`.Append(_mlContext.Transforms.Text.FeaturizeText(textColumn, new TextFeaturizingEstimator.Options()
{
CharFeatureExtractor = new WordHashBagEstimator.Options() { NgramLength = 3 },
WordFeatureExtractor = new WordHashBagEstimator.Options() { NgramLength = 2 }
});
The only other trainable transform I see in your pipeline is the MapValueToKey
, which I discussed in my previous comment on how to make it stateless.
Thank you @justinormont for the helpful info! To add to Justin's first option, you can use LoadFromDataLoader
in C# to load from a file or stream, depending on your use case. I've linked the appropriate LoadFromDataLoader
public methods from the API for your use.
@gagy3798 Do these address your concerns on this issue?
I still can't solve the problem that after retraining I can't get a prediction.
Hi team, any new updates about the same issue above about the unseen new data
'No valid training instances found, all instances have missing features
Kindly let us know
thanks
System information
Issue
I'm trying to do MultiClass LbfgsMaximumEntropy Re-training When trying to Fit new data, I get System.InvalidOperationException: 'No valid training instances found, all instances have missing features.' on row ITransformer _keyToValueModel1 = _mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy("Label", "Features") .Fit(transformedData, originalModelParameters.Model); data1.zip
I would appreciate either help or MultiClass LbfgsMaximumEntropy Re-training code sample.
Source code / logs