dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Error During Retraining with New Labels #7187

Open willysoft opened 4 months ago

willysoft commented 4 months ago

Issue Description

I encountered an issue while attempting to retrain a model using the ML.NET framework. The retraining works perfectly when the new data contains existing labels, but it fails with the following error when new labels (not present in the original training data) are introduced:

// Retrain model
var retrainedModel = mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(
    new LbfgsMaximumEntropyMulticlassTrainer.Options() { 
        L1Regularization = 0.1195667F, 
        L2Regularization = 0.03125F, 
        LabelColumnName = @"col1", 
        FeatureColumnName = @"Features" 
    }).Fit(transformedNewData, originalModelParameters);

Error Message

System.InvalidOperationException: 'No valid training instances found, all instances have missing features.'

Steps to Reproduce

  1. Train an initial model using a dataset with a specific set of labels.
  2. Attempt to retrain the model using a new dataset that includes labels not present in the original dataset.

Expected Behavior

The model should be able to retrain successfully even when new labels are introduced in the retraining dataset.

Actual Behavior

The retraining process fails with an InvalidOperationException, stating that there are no valid training instances because all instances have missing features.

Environment

Code Sample

public static void ReTrain(string outputModelPath, IEnumerable<ModelInput> newDatas)
{
    var mlContext = new MLContext();

    // Define DataViewSchema of data prep pipeline and trained model
    DataViewSchema dataPrepPipelineSchema, modelSchema;

    // Load data preparation pipeline and trained model
    var dataPrepPipeline = mlContext.Model.Load("data_preparation_pipeline.zip", out dataPrepPipelineSchema);
    var trainedModel = mlContext.Model.Load("ogd_model.zip", out modelSchema);

    // Extract trained model parameters
    var transformers = (IEnumerable<ITransformer>)trainedModel;
    var originalModelParameters = ((MulticlassPredictionTransformer<MaximumEntropyModelParameters>?)transformers.FirstOrDefault(x => x is MulticlassPredictionTransformer<MaximumEntropyModelParameters>))?.Model;

    // Load New Data
    var newDataView = mlContext.Data.LoadFromEnumerable(newDatas);

    // Preprocess Data
    var transformedNewData = dataPrepPipeline.Transform(newDataView);

    // Retrain model
    var retrainedModel = mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(
        new LbfgsMaximumEntropyMulticlassTrainer.Options() { 
            L1Regularization = 0.1195667F, 
            L2Regularization = 0.03125F, 
            LabelColumnName = @"col1", 
            FeatureColumnName = @"Features" 
        }).Fit(transformedNewData, originalModelParameters);
}