dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

Failed in using MultiClassClassification trainers other than StochasticDualCoordinateAscent with error "System.ArgumentOutOfRangeException: 'Schema mismatch for label column '': expected Key<U4>, got R4" #2656

Closed darren-zdc closed 5 years ago

darren-zdc commented 5 years ago

Issue

I'm trying to use other MulticlassClassification trainers but never succeed. The only one succeeded is StochasticDualCoordinateAscent. If i change to LogisticRegression or NaiveBayes, there will always be a error "System.ArgumentOutOfRangeException: 'Schema mismatch for label column '': expected Key, got R4".

MultiData.cs

public class MultiData
    {
        [LoadColumn(0)]
        public string DataValue { get; set; }
        [LoadColumn(1)]
        public float Label { get; set; }
    }

MultiDataPrediction.cs

public class MultiDataPrediction
    {
        public float[] Score { get; set; }
    }

BuildTrainEvaluateAndSaveModel() function

            // STEP 1: Common data loading configuration
            IDataView trainingDataView = mlContext.Data.ReadFromTextFile<MultiData>(TrainMultiDataPath1, hasHeader: false);
            IDataView testDataView = mlContext.Data.ReadFromTextFile<MultiData>(TestMultiDataPath, hasHeader: false);

            // STEP 2: Common data process configuration with pipeline data transformations          
            var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: DefaultColumnNames.Features, inputColumnName: nameof(MultiData.DataValue))
                .Append(mlContext.Transforms.Text.NormalizeText("NormalizedData", nameof(MultiData.DataValue)))
                .Append(mlContext.Transforms.Text.TokenizeCharacters("DataChars", "NormalizedData"))
                .Append(new NgramExtractingEstimator(mlContext, "BagOfTrichar", "DataChars",
                            ngramLength: 3, weighting: NgramExtractingEstimator.WeightingCriteria.TfIdf));

            // (OPTIONAL) Peek data (such as 2 records) in training DataView after applying the ProcessPipeline's transformations into "Features" 
            //ConsoleHelper.PeekDataViewInConsole<MultiData>(mlContext, trainingDataView, dataProcessPipeline, 2);
            //ConsoleHelper.PeekVectorColumnDataInConsole(mlContext, DefaultColumnNames.Features, trainingDataView, dataProcessPipeline, 1);

            // STEP 3: Set the training algorithm, then create and config the modelBuilder          
            var trainer = mlContext.MulticlassClassification.Trainers.NaiveBayes(labelColumn: nameof(MultiData.Label), featureColumn: DefaultColumnNames.Features);
            var trainingPipeline = dataProcessPipeline.Append(trainer);

            // STEP 4: Train the model fitting to the DataSet
            Console.WriteLine("=============== Training the model ===============");
            ITransformer trainedModel = trainingPipeline.Fit(trainingDataView);

Remark: Even I change the type of the MultiData.Label to UInt32 will not be working as well. With Error, "System.ArgumentOutOfRangeException: 'Schema mismatch for label column '': expected Key, got U4"

Ivanidzo4ka commented 5 years ago

related to #https://github.com/dotnet/machinelearning/issues/2628

darren-zdc commented 5 years ago

Thanks for your reply!! I solve it by adding .Append(mlContext.Transforms.Conversion.MapValueToKey(outputColumnName: DefaultColumnNames.Label, inputColumnName: nameof(MultiData.Label)));

Maybe should add this line in all the MultiClass Classification samples, since all the samples are using SDCA, and SDCA will actually auto doing the keyMapping. That will be excellent for all the new learners~