dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

Unseen labels during retraining map to value "0", results in System.InvalidOperationException: 'No valid training instances found, all instances have missing features.' #5214

Open gagy3798 opened 4 years ago

gagy3798 commented 4 years ago

System information

Issue

I'm trying to do MultiClass LbfgsMaximumEntropy Re-training When trying to Fit new data, I get System.InvalidOperationException: 'No valid training instances found, all instances have missing features.' on row ITransformer _keyToValueModel1 = _mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy("Label", "Features") .Fit(transformedData, originalModelParameters.Model); data1.zip

I would appreciate either help or MultiClass LbfgsMaximumEntropy Re-training code sample.

Source code / logs

[data1.zip](https://github.com/dotnet/machinelearning/files/4744619/data1.zip)
[data2.zip](https://github.com/dotnet/machinelearning/files/4744620/data2.zip)

public class GitHubIssueClassification
  {
    static List<GitHubIssueTransformed> testDatas = new List<GitHubIssueTransformed>()
    {
      new GitHubIssueTransformed() {Area="11", Title="WHIRLPOOL AWE 50610", Description="" },
      new GitHubIssueTransformed() {Area="14", Title="FAGOR 4CC-130 E X", Description="" },
      new GitHubIssueTransformed() {Area="19", Title="AEG T8DFE68SC", Description="" },
      new GitHubIssueTransformed() {Area="999", Title="TEST 999", Description="" }
    };

    private static string _appPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
    private static string _mainDataPath1 => Path.Combine(_appPath, "..", "..", "..", "Data", "data1.csv");
    private static string _mainDataPath2 => Path.Combine(_appPath, "..", "..", "..", "Data", "data2.csv");
    private static string _mainDataPath3 => Path.Combine(_appPath, "..", "..", "..", "Data", "data3.csv");
    private static string _modelPath => Path.Combine(_appPath, "..", "..", "..", "Models", "trainedModel.zip");
    private static string _keyToValueModelPath => Path.Combine(_appPath, "..", "..", "..", "Models", "keyToValueModel.zip");

    static DataOperationsCatalog.TrainTestData splittedData;

    private static MLContext _mlContext;
    private static PredictionEngine<GitHubIssueTransformed, IssuePrediction> _predEngine;
    private static ITransformer _trainedModel { get; set; }
    private static ITransformer _keyToValueModel { get; set; }
    static IDataView _trainingDataView;
    public static void Run()
    {
      _mlContext = new MLContext(seed: 0);

      var allData = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_mainDataPath1, hasHeader: true);
      splittedData = _mlContext.Data.TrainTestSplit(allData, testFraction: 0.09);
      _trainingDataView = splittedData.TrainSet;
      Console.WriteLine($"=============== Finished Loading Dataset  ===============");

      var pipeline = ProcessData();
      var transformedData = BuildAndTrainModel(_trainingDataView, pipeline);
      Evaluate(_trainingDataView.Schema, transformedData, splittedData.TestSet);
      PredictIssue_FirstLoadModelFromDisk();

      SecondLap(_mlContext);
    }

    public static IEstimator<ITransformer> ProcessData()
    {
      Console.WriteLine($"=============== Processing Data ===============");
      var pipeline = _mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "Area", outputColumnName: "Label")
                      .Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Title", outputColumnName: "TitleFeaturized"))
                      .Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description", outputColumnName: "DescriptionFeaturized"))
                      .Append(_mlContext.Transforms.Concatenate("Features", "TitleFeaturized", "DescriptionFeaturized"))
                      .AppendCacheCheckpoint(_mlContext);

      Console.WriteLine($"=============== Finished Processing Data ===============");

      return pipeline;
    }

    public static IDataView BuildAndTrainModel(IDataView trainingDataView, IEstimator<ITransformer> pipeline)
    {
      //var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy/*.SdcaMaximumEntropy*/("Label", "Features"))
      //    .Append(_mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));
      var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy/*.SdcaMaximumEntropy*/("Label", "Features"));
      //var keyToValuePipeline = _mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel");
      var keyToValuePipeline = trainingPipeline.Append(_mlContext.Transforms.Conversion.MapKeyToValue("Area", "PredictedLabel"));

      Console.WriteLine($"=============== Training the model  ===============");

      _trainedModel = trainingPipeline.Fit(trainingDataView);
      var transformedData = _trainedModel.Transform(trainingDataView);
      _keyToValueModel = keyToValuePipeline.Fit(transformedData);

      _mlContext.Model.Save(_trainedModel, trainingDataView.Schema, _modelPath);
      _mlContext.Model.Save(_keyToValueModel, transformedData.Schema, _keyToValueModelPath);

      Console.WriteLine($"=============== Finished Training the model Ending time: {DateTime.Now.ToString()} ===============");

      // (OPTIONAL) Try/test a single prediction with the "just-trained model" (Before saving the model)
      Console.WriteLine($"=============== Single Prediction just-trained-model ===============");

      _predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssueTransformed, IssuePrediction>(_keyToValueModel);
      foreach (var testIssue in testDatas)
      {
        var prediction = _predEngine.Predict(testIssue);
        if (prediction.Area.ToString() != testIssue.Area.ToString())
          Console.ForegroundColor = ConsoleColor.Red;
        else
          Console.ForegroundColor = ConsoleColor.Blue;
        Console.WriteLine($"=============== Single Prediction just-trained-model - Result: {prediction.Area}/{testIssue.Area} {testIssue.Title} ===============");
      }
      Console.ResetColor();      

      return transformedData;
    }

    public static void Evaluate(DataViewSchema trainingDataViewSchema, IDataView transformedData, IDataView testDataView2 = null)
    {
      // STEP 5:  Evaluate the model in order to get the model's accuracy metrics
      Console.WriteLine($"=============== Evaluating to get model's accuracy metrics - Starting time: {DateTime.Now.ToString()} ===============");

      IDataView testDataView = testDataView2;

      var testMetrics = _mlContext.MulticlassClassification.Evaluate(_trainedModel.Transform(testDataView));

      Console.WriteLine($"=============== Evaluating to get model's accuracy metrics - Ending time: {DateTime.Now.ToString()} ===============");
      Console.WriteLine($"*************************************************************************************************************");
      Console.WriteLine($"*       Metrics for Multi-class Classification model - Test Data     ");
      Console.WriteLine($"*------------------------------------------------------------------------------------------------------------");
      Console.WriteLine($"*       MicroAccuracy:    {testMetrics.MicroAccuracy:0.###}");
      Console.WriteLine($"*       MacroAccuracy:    {testMetrics.MacroAccuracy:0.###}");
      Console.WriteLine($"*       LogLoss:          {testMetrics.LogLoss:#.###}");
      Console.WriteLine($"*       LogLossReduction: {testMetrics.LogLossReduction:#.###}");
      Console.WriteLine($"*************************************************************************************************************");

      SaveModelAsFile(_mlContext, trainingDataViewSchema, transformedData, _trainedModel, _keyToValueModel);
    }

    public static void PredictIssue_FirstLoadModelFromDisk()
    {
      //ITransformer loadedModel = _mlContext.Model.Load(_modelPath, out var modelInputSchema);
      ITransformer loadedModel = _mlContext.Model.Load(_keyToValueModelPath, out var modelInputSchema);

      foreach (var testIssue in testDatas)
      {
        _predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssueTransformed, IssuePrediction>(loadedModel);
        var prediction = _predEngine.Predict(testIssue);
        if (prediction.Area.ToString() != testIssue.Area.ToString())
          Console.ForegroundColor = ConsoleColor.Red;
        else
          Console.ForegroundColor = ConsoleColor.Blue;
        Console.WriteLine($"=============== Single Prediction - Result: {prediction.Area}/{testIssue.Area} {testIssue.Title} ===============");
        Console.ResetColor();
      }
    }

    private static void SaveModelAsFile(MLContext mlContext, DataViewSchema trainingDataViewSchema,
      IDataView transformedData, ITransformer _trainedModel, ITransformer _keyToValueModel)
    {
      //mlContext.Model.Save(_trainedModel, trainingDataViewSchema, _modelPath);
      //mlContext.Model.Save(_keyToValueModel, transformedData.Schema, _keyToValueModelPath);
      Console.WriteLine("The model is saved to {0}", _modelPath);
    }

    static void SecondLap(MLContext _mlContext)
    {
      var allData = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_mainDataPath2, hasHeader: true);
      splittedData = _mlContext.Data.TrainTestSplit(allData, testFraction: 0.09);
      _trainingDataView = splittedData.TrainSet;

      ITransformer dataPrepPipeline = _mlContext.Model.Load(_modelPath, out var dataPrepPipelineSchema);
      var originalModelParameters = (dataPrepPipeline as TransformerChain<ITransformer>).LastTransformer as MulticlassPredictionTransformer<MaximumEntropyModelParameters>;      

      int rowsCount = splittedData.TrainSet.Preview().RowView.Count();
      //var transformedData = dataPrepPipeline.Transform(splittedData.TrainSet);
      //var transformedData = _keyToValueModel.Transform(splittedData.TrainSet);
      var transformedData = _trainedModel.Transform(splittedData.TrainSet);

      ITransformer _keyToValueModel1 = _mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy("Label", "Features")
        .Fit(transformedData, originalModelParameters.Model);
    }
  }
public class GitHubIssue
  {
    [LoadColumn(0)]
    public string ID { get; set; }
    [LoadColumn(1)]
    public string Area { get; set; }
    [LoadColumn(2)]
    public string Title { get; set; }
    [LoadColumn(3)]
    public string Description { get; set; }
  }

  public class GitHubIssueTransformed: GitHubIssue
  {
    //[ColumnName("PredictedLabel")]
    //public string XX;
  }

  public class IssuePrediction
  {
    //[ColumnName("PredictedLabel")]
    public string Area;
  }
mstfbl commented 4 years ago

Hi @gagy3798,

I have been able to reproduce your results. I believe the method in which you are acquiring the model parameters with the following line is incorrect, and you are not correctly casting MaximumEntropyModelParameters:

var originalModelParameters = (dataPrepPipeline as TransformerChain<ITransformer>).LastTransformer as MulticlassPredictionTransformer<MaximumEntropyModelParameters>;

When I ran your code in debug mode and inspected originalModelParameters.Model, I see that MaximumEntropyModelParameters originalModelParameters is not loaded correctly: inspection_issue_5214 Here, Model has its values under Non-Public members, whereas they should be accessible normally. This is why you're receiving the "No valid training instances found" exception.

gagy3798 commented 4 years ago

Hi @mstfbl thank you for inspecting. I use ML.NET 1.5.0 and I dont see this values as Non-Public Members but public members.

exc

mstfbl commented 4 years ago

Hi @gagy3798,

My teammate Antonio @antoniovs1029 and I debugged in detail your code and error. Thank you Antonio for your help. :D

We realized that you are training your _trainedModel model (without the MapKeyToValue("Area", "PredictedLabel")) to make predictions on the labels of your first dataset (which range between the values 11-13), and then attempting to retrain your original model with LbfgsMaximumEntropy on the labels of your second dataset (which range between 14-17). This is a problem, as your pre-processing pipeline has a ValueToKey mapping transformer on the Label column. When trained on the 1st dataset, it learns to map the Label value "11" to "1", "12" to "2", and "13" to "3" However, when you use this same trained transformer to map on the 2nd dataset, the Label values 14-17 map all to 0, as it's never encountered these values before. These 0 values are interpreted as missing labels, hence the exact error. I've confirmed that these Label values of 14-17 indeed map to 0 on my reproduction.

Put simply, you're asking MulticlassClassification.Trainers.LbfgsMaximumEntropy to map to values it's never seen before, because of the fact that the same ValueToKey mapping transformer is used.

There are two ways you can fix this:

  1. Change the Label data in both of your datasets so that your label values have the same range.
  2. Remove the MapKeyToValue from your pre-processing pipeline, and use a new pipeline which consists of your pre-processing pipeline plus a new MapKeyToValue transformer before you train and also before you retrain your models.

However, this issue is a clear sign that ML.NET does not warn the user that values is does not know how to map are by default mapped to 0, which should not be the case.

gagy3798 commented 4 years ago

Hi @mstfbl

OK, fix 1 works, but it means I can't have a new category (label) when retraining model. I made changes to data1.csv so it contains all categories. After retraining I get no exceptions, but new model has 0 accuracy and everything I try to predict is correct, also categories which are absolutely not in training data (category 999). Something is bad.

Fix 2 - maybe I dont understand it, I already have pipeline without and also with MapKeyToValue, but I'm not able to get it working

I actualized code

[data1.zip](https://github.com/dotnet/machinelearning/files/4757484/data1.zip)

using GitHubIssueClassification;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Trainers;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace ConsoleApp1
{
  public class GitHubIssueClassification
  {
    static List<GitHubIssueTransformed> testDatas = new List<GitHubIssueTransformed>()
    {
      new GitHubIssueTransformed() {Area="11", Title="WHIRLPOOL AWE 50610", Description="" },
      new GitHubIssueTransformed() {Area="13", Title="GORENJE K5151WH", Description="" },
      new GitHubIssueTransformed() {Area="13", Title="sporák", Description="" },
      new GitHubIssueTransformed() {Area="14", Title="FAGOR 4CC-140", Description="" },
      new GitHubIssueTransformed() {Area="19", Title="AEG T8DFE68SC", Description="" },
      new GitHubIssueTransformed() {Area="999", Title="TEST 999", Description="" }
    };

    private static string _appPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
    private static string _mainDataPath1 => Path.Combine(_appPath, "..", "..", "..", "Data", "data1.csv");
    private static string _mainDataPath2 => Path.Combine(_appPath, "..", "..", "..", "Data", "data2.csv");
    private static string _modelPath => Path.Combine(_appPath, "..", "..", "..", "Models", "trainedModel.zip");
    private static string _keyToValueModelPath => Path.Combine(_appPath, "..", "..", "..", "Models", "keyToValueModel.zip");

    static DataOperationsCatalog.TrainTestData splittedData;

    private static MLContext _mlContext;
    private static PredictionEngine<GitHubIssueTransformed, IssuePrediction> _predEngine;
    private static ITransformer _trainedModel { get; set; }
    private static ITransformer _keyToValueModel { get; set; }
    static IDataView _trainingDataView;
    public static void Run()
    {
      _mlContext = new MLContext(seed: 0);

      var allData = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_mainDataPath1, hasHeader: true);
      splittedData = _mlContext.Data.TrainTestSplit(allData, testFraction: 0.2);
      _trainingDataView = splittedData.TrainSet;
      Console.WriteLine($"=============== Loading Dataset data1.csv (initial data) ===============");

      var transformedData = BuildAndTrainModel(_trainingDataView);

      SecondLap(_mlContext);
    }

    public static IDataView BuildAndTrainModel(IDataView trainingDataView)
    {
      var pipeline = _mlContext.Transforms.Conversion.MapValueToKey(inputColumnName: "Area", outputColumnName: "Label")
        .Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Title", outputColumnName: "TitleFeaturized"))
        .Append(_mlContext.Transforms.Text.FeaturizeText(inputColumnName: "Description", outputColumnName: "DescriptionFeaturized"))
        .Append(_mlContext.Transforms.Concatenate("Features", "TitleFeaturized", "DescriptionFeaturized"))
        .AppendCacheCheckpoint(_mlContext);

      var trainingPipeline = pipeline.Append(_mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy("Label", "Features"));

      var keyToValuePipeline = trainingPipeline.Append(_mlContext.Transforms.Conversion.MapKeyToValue("Area", "PredictedLabel"));

      Console.WriteLine($"=============== Training the model  ===============");

      _trainedModel = trainingPipeline.Fit(trainingDataView);
      var transformedData = _trainedModel.Transform(trainingDataView);
      _keyToValueModel = keyToValuePipeline.Fit(transformedData);

      _mlContext.Model.Save(_trainedModel, trainingDataView.Schema, _modelPath);
      _mlContext.Model.Save(_keyToValueModel, transformedData.Schema, _keyToValueModelPath);

      Console.WriteLine($"=============== Finished Training the model Ending time: {DateTime.Now.ToString()} ===============");

      Evaluate(_trainingDataView.Schema, transformedData, splittedData.TestSet);
      SinglePredictionFromMemory();
      PredictIssue_FirstLoadModelFromDisk();

      return transformedData;
    }

    public static void Evaluate(DataViewSchema trainingDataViewSchema, IDataView transformedData, IDataView testDataView2 = null)
    {
      Console.WriteLine($"=============== Evaluating to get model's accuracy metrics - Starting time: {DateTime.Now.ToString()} ===============");

      IDataView testDataView = testDataView2;

      var testMetrics = _mlContext.MulticlassClassification.Evaluate(_trainedModel.Transform(testDataView));

      Console.WriteLine($"=============== Evaluating to get model's accuracy metrics - Ending time: {DateTime.Now.ToString()} ===============");
      Console.WriteLine($"*************************************************************************************************************");
      Console.WriteLine($"*       Metrics for Multi-class Classification model - Test Data     ");
      Console.WriteLine($"*------------------------------------------------------------------------------------------------------------");
      Console.WriteLine($"*       MicroAccuracy:    {testMetrics.MicroAccuracy:0.###}");
      Console.WriteLine($"*       MacroAccuracy:    {testMetrics.MacroAccuracy:0.###}");
      Console.WriteLine($"*       LogLoss:          {testMetrics.LogLoss:#.###}");
      Console.WriteLine($"*       LogLossReduction: {testMetrics.LogLossReduction:#.###}");
      Console.WriteLine($"*************************************************************************************************************");
    }

    static void SinglePredictionFromMemory()
    {
      // (OPTIONAL) Try/test a single prediction with the "just-trained model" (Before saving the model)
      Console.WriteLine($"=============== Single Prediction just-trained-model ===============");

      _predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssueTransformed, IssuePrediction>(_keyToValueModel);
      foreach (var testIssue in testDatas)
      {
        var prediction = _predEngine.Predict(testIssue);
        if (prediction.Area.ToString() != testIssue.Area.ToString())
          Console.ForegroundColor = ConsoleColor.Red;
        else
          Console.ForegroundColor = ConsoleColor.Blue;
        Console.WriteLine($"=============== predicted result: {prediction.Area} - should be: {testIssue.Area} - {testIssue.Title} ===============");
      }
      Console.ResetColor();
    }

    public static void PredictIssue_FirstLoadModelFromDisk()
    {
      Console.WriteLine("=============== Single Prediction model-loaded-from-disk ===============");

      //ITransformer loadedModel = _mlContext.Model.Load(_modelPath, out var modelInputSchema);
      ITransformer loadedModel = _mlContext.Model.Load(_keyToValueModelPath, out var modelInputSchema);

      foreach (var testIssue in testDatas)
      {
        _predEngine = _mlContext.Model.CreatePredictionEngine<GitHubIssueTransformed, IssuePrediction>(loadedModel);
        var prediction = _predEngine.Predict(testIssue);
        if (prediction.Area.ToString() != testIssue.Area.ToString())
          Console.ForegroundColor = ConsoleColor.Red;
        else
          Console.ForegroundColor = ConsoleColor.Blue;
        Console.WriteLine($"=============== predicted result: {prediction.Area} - should be: {testIssue.Area} - {testIssue.Title} ===============");
        Console.ResetColor();
      }
    }

    static void SecondLap(MLContext _mlContext)
    {
      Console.WriteLine("\nSecondLap - retrain new data\n");
      Console.WriteLine($"=============== Loading Dataset data2.csv (new data) ===============");
      var newData = _mlContext.Data.LoadFromTextFile<GitHubIssue>(_mainDataPath2, hasHeader: true);
      splittedData = _mlContext.Data.TrainTestSplit(newData, testFraction: 0.2);

      _trainedModel = _mlContext.Model.Load(_modelPath, out var _);
      _keyToValueModel = _mlContext.Model.Load(_keyToValueModelPath, out var _);
      var originalModelParameters = (_trainedModel as TransformerChain<ITransformer>).LastTransformer
        as MulticlassPredictionTransformer<MaximumEntropyModelParameters>;

      var transformedData = _trainedModel.Transform(splittedData.TrainSet);

      _keyToValueModel = _mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy("Label", "Features")
        .Fit(transformedData, originalModelParameters.Model);

      _mlContext.Model.Save(_keyToValueModel, transformedData.Schema, _keyToValueModelPath);

      Evaluate(splittedData.TrainSet.Schema, transformedData, splittedData.TestSet);
      SinglePredictionFromMemory();
      PredictIssue_FirstLoadModelFromDisk();
    }
  }

  public class GitHubIssue
  {
    [LoadColumn(0)]
    public string ID { get; set; }
    [LoadColumn(1)]
    public string Area { get; set; }
    [LoadColumn(2)]
    public string Title { get; set; }
    [LoadColumn(3)]
    public string Description { get; set; }
  }

  public class GitHubIssueTransformed : GitHubIssue
  {
    [VectorType(38380)]
    public float[] Features;
  }

  public class IssuePrediction
  {
    //[ColumnName("PredictedLabel")]
    public string Area;

    //[VectorType(66667)]
    //float[] Features;
  }
}
mstfbl commented 4 years ago

Hi @gagy3798 ,

Currently, retraining of classification models with differing labels is not supported on ML.NET. This can be a feature request for future implementation, but as it stands now, labels that are unseen before retraining are inputted as 0, which result in the "No valid training instances found, all instances have missing features." For the time being, one way to get around this is to add extra data to data1.csv, where you introduce the label values 14-17. The extra rows that you add can be empty except for the labels. This way, during retraining, as the model would have had already seen labels between 11-17, you would not get this error. However I do not know how mathematically correct this would be.

gagy3798 commented 4 years ago

Hi @mstfbl,

thank you for support. Ok, I gave all labels in first training data batch. Another important problem is, that after retraining, I get no exception, but I will get no more predictions and totally loose model accuracy.

secondlap

mstfbl commented 4 years ago

Hi @gagy3798,

Yes, this inaccuracy is due to the fact that the labels 14-17 are added with no real data, so the retraining doe not have any real data to retrain from. So for the time being, your best best would be to match the labels in data1.csv with labels in data2.csv, this way the labels can be accuracy mapped and retraining will provide much better results.

gagy3798 commented 4 years ago

Hi @mstfbl,

  1. I mixed data between files data1.csv and data2.csv as you suggested and after retraining now I have some MicroAccuracy (little bit worser than its after initial training). But after retraining I still cannot predict - prediction returns always no label.

  2. After changing data in data1.csv I always get exception f.e. System.InvalidOperationException: 'Incompatible features column type: 'Vector<Single, 45576>' vs 'Vector<Single, 36473>'' After I change VectorType(XXXXX) to new number its ok. Isn't there better way how to get array size than waiting for exception?

    public class GitHubIssueTransformed : GitHubIssue
    {
    [VectorType(40649)]
    public float[] Features;
    }

data1.zip data2.zip

justinormont commented 4 years ago

I might recommend using the keyData parameter in your MapValueToKey(). Assuming you know all the possible class names beforehand, this allows the MapValueToKey to store a mapping for all class names.

https://github.com/dotnet/machinelearning/blob/140cb70b7d9e38a65500549f963f6bbe171ec0ab/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Conversion/MapValueToKeyMultiColumn.cs#L61-L86

Once the MapValueToKey creates the correct Key range, I expect the trainer will also create weights for these classes even if no examples are present. Without examples (data rows) of the future classes, the trainer should still get good accuracy on the classes present in the first round of training, while having the ability to fill in the model weights for them later.

gagy3798 commented 4 years ago

Thank you. Its cool.

But problem is that Im still not able to get any prediction after retraining.

justinormont commented 4 years ago

System.InvalidOperationException: 'Incompatible features column type: 'Vector<Single, 45576>' vs 'Vector<Single, 36473>''

This is due to refitting the text transform (FeaturizeText). The output vector from the second run doesn't match the original (either in size or composition).

After I change VectorType(XXXXX) to new number its ok. Isn't there better way how to get array size than waiting for exception?

Resizing the vector to the original size won't help as the slots will be moved around. For instance, the word "cat" in the first run landed in slot 10328, but in the second ends up in slot 9310. In the dictionary based (non-hashing) option of the text transform, slot numbers are assigned in the order the words are seen in the training dataset. The changing of the slots would mean the model's existing weights would not be helpful.

There are two options:

  1. Reuse the existing text transform for the second refitting pass. I'm unsure how to do this in the C# Estimators API being used here. In the MAML API language I would use xf=LoadTransform{in=ExistingModel.zip} to load the previously trained transforms. Perhaps in the Estimators API, you can just use the existing model to featurize the dataset, then in your refitting pipeline, include only final trainer (and not the featurization pipeline steps). `
  2. Use stateless transforms. Using a hashing text transform will ensure the slots always align between runs. The hashing is stable and causes "cat" to always land in the same slot.
    .Append(_mlContext.Transforms.Text.FeaturizeText(textColumn, new TextFeaturizingEstimator.Options()
      {
        CharFeatureExtractor = new WordHashBagEstimator.Options() { NgramLength = 3 },
        WordFeatureExtractor = new WordHashBagEstimator.Options() { NgramLength = 2 }
    });

    The only other trainable transform I see in your pipeline is the MapValueToKey, which I discussed in my previous comment on how to make it stateless.

mstfbl commented 4 years ago

Thank you @justinormont for the helpful info! To add to Justin's first option, you can use LoadFromDataLoader in C# to load from a file or stream, depending on your use case. I've linked the appropriate LoadFromDataLoader public methods from the API for your use.

@gagy3798 Do these address your concerns on this issue?

gagy3798 commented 4 years ago

I still can't solve the problem that after retraining I can't get a prediction.

ahmwai commented 1 year ago

Hi team, any new updates about the same issue above about the unseen new data

'No valid training instances found, all instances have missing features

Kindly let us know

thanks