dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
264 stars 56 forks source link

Named Entity Recognition: Generated code to retrain is not working #2877

Open Polak149 opened 7 months ago

Polak149 commented 7 months ago

System Information (please complete the following information):

Train function generated by Builder is not working for "Named Entity Recognition" and cause exception:

System.ArgumentOutOfRangeException: 'Cannot map column (name: Label, type: Key<UInt32, 0-0>) in data to the user-defined type, Microsoft.ML.Data.VBuffer`1[System.UInt32]. Arg_ParamName_Name'

Using builder, a was able to generate Named Entity Recognition mlnet model. Builder generated *.training.cs file with "Train" function:

/// <summary>
/// Train a new model with the provided dataset.
/// </summary>
/// <param name="outputModelPath">File path for saving the model. Should be similar to "C:\YourPath\ModelName.mlnet"</param>
/// <param name="inputDataFilePath">Path to the data file for training.</param>
/// <param name="separatorChar">Separator character for delimited training file.</param>
/// <param name="hasHeader">Boolean if training file has a header.</param>
public static void Train(string outputModelPath, string inputDataFilePath = RetrainFilePath, char separatorChar = RetrainSeparatorChar, bool hasHeader = RetrainHasHeader, bool allowQuoting = RetrainAllowQuoting)
{
    var mlContext = new MLContext();
    var data = LoadIDataViewFromFile(mlContext, inputDataFilePath, separatorChar, hasHeader, true);
    var model = RetrainModel(mlContext, data);
    SaveModel(mlContext, model, data, outputModelPath);
}

Trying to use this function cause an exception on:

/// <summary>
/// Retrain model using the pipeline generated as part of the training process.
/// </summary>
/// <param name="mlContext"></param>
/// <param name="trainData"></param>
/// <returns></returns>
public static ITransformer RetrainModel(MLContext mlContext, IDataView trainData)
{
    var pipeline = BuildPipeline(mlContext);
    var model = pipeline.Fit(trainData); // <-HERE AN EXCEPTION IS THROWN

    return model;
}

System.ArgumentOutOfRangeException: 'Cannot map column (name: Label, type: Key<UInt32, 0-0>) in data to the user-defined type, Microsoft.ML.Data.VBuffer`1[System.UInt32]. Arg_ParamName_Name'

Here is the example dataset i made for the sake of this post but every data set i have tried is not working: test data example.txt

Polak149 commented 6 months ago

The problem is generated in *.training.cs function 'LoadIDataViewFromFile' that is loading dataset.txt without tags. I was able to workaround this problem by creating own function to train:

private class Label(string key)
{
    public readonly string Key = key;
}

public static void TrainNER(string outputModelPath, string inputLabelsFilePath, string inputDataFilePath)
{
    IEnumerable<Label> GetLabels(string inputLabelsFilePath)
    {
        var lines = File.ReadLines(inputLabelsFilePath);
        return lines.Select(x => new Label(x));
    }
    IEnumerable<ModelInput> GetLine(string fileName)
    {
        using StreamReader sr = File.OpenText(fileName);
        string? line;
        while ((line = sr.ReadLine()) != null)
        {
            var split = line.Split('\t');
            yield return new ModelInput()
            {
                Sentence = split[0],
                Label = split[1..]
            };
        }
    }
    var mlContext = new MLContext();

    var labels = mlContext.Data.LoadFromEnumerable(GetLabels(inputLabelsFilePath));
    var dataView = mlContext.Data.LoadFromEnumerable(GetLine(inputDataFilePath));

    var chain = new EstimatorChain<ITransformer>();
var estimator = chain.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", keyData: labels))
       .Append(mlContext.MulticlassClassification.Trainers.NamedEntityRecognition(outputColumnName: "predicted_label", batchSize: 32, maxEpochs: 10))
       .Append(mlContext.Transforms.Conversion.MapKeyToValue("predicted_label"));
    using var transformer = estimator.Fit(dataView);

    // function automaticaly generated in *.training.cs
    SaveModel(mlContext, transformer, dataView, outputModelPath);
}
LittleLittleCloud commented 6 months ago

@zewditu Can you take a look