dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.93k stars 1.86k forks source link

Field with type string cannot be transformed for one hot encoder #6889

Open VadimPeczynski opened 7 months ago

VadimPeczynski commented 7 months ago

System Information (please complete the following information):

Describe the bug I'm not able to process the data which I'm providing, when I'm using in the model one hot encoder. The string can not be processed. image

To Reproduce Steps to reproduce the behavior:

//Define DataViewSchema for data preparation pipeline and trained model
DataViewSchema dataPrepPipelineSchema, modelSchema;

// Load trained model
ITransformer dataPrepPipeline = mlContext.Model.Load("data_preparation_pipeline.zip", out dataPrepPipelineSchema);
ITransformer predictionPipeline = mlContext.Model.Load("model.zip", out modelSchema);

//Load New Data
var newData = DataFrame.LoadCsv("data/input.csv");

// Preprocess Data
IDataView transformedNewData = dataPrepPipeline.Transform(newData);

IDataView predictions = predictionPipeline.Transform(transformedNewData);

Expected behavior Model can load data with type string data_preparation_pipeline.zip model.zip input.csv

luisquintanilla commented 7 months ago

Hi @VadimPeczynski,

Is the right column or value? The error says you're trying to load a float value when it's expecting a string. Do you have the actual pipeline available to see how you're building the data prep pipeline?

ghost commented 7 months ago

This issue has been marked needs-author-action and may be missing some important information.

VadimPeczynski commented 6 months ago

Hi @luisquintanilla,

The code for the transformtaion pipeline looks like this:

var pipelineEstimator =
    mlContext.Transforms.ReplaceMissingValues(new[] {
                new InputOutputColumnPair("total_bedrooms")
            },
            MissingValueReplacingEstimator.ReplacementMode.Mode)
        .Append(mlContext.Transforms.Categorical.OneHotEncoding(
            new[]
            {
                new InputOutputColumnPair("ocean_proximity")
            }, OneHotEncodingEstimator.OutputKind.Indicator));

The data that was attached is one the items from my train set so the format should be compatible with pipeline.

I'm saving the pipeline using this command:

// Save Data Prep transformer
mlContext.Model.Save(pipelineEstimator.Fit(testData), testData.Schema, "data_preparation_pipeline.zip");
VadimPeczynski commented 5 months ago

Hi @luisquintanilla,

Can you reproduce the issue? Do you need more informations? Is there any fix to it?