dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

AutoML - Concatenated columns should have the same type. Column 'IntColumn2' has type of Single, but expected column type is Byte. #6481

Open rzechu opened 1 year ago

rzechu commented 1 year ago

System Information (please complete the following information):

Describe the bug I am trying to use Microsoft.ML.AutoML.MultiClassificationExperiment with preloaded SQLData

System.InvalidOperationException: 'Training failed with the exception: System.InvalidOperationException: Concatenated columns should have the same type. Column 'Doc_IntColumn2' has type of Single, but expected column type is Byte.
   at Microsoft.ML.Transforms.ColumnConcatenatingEstimator.CheckInputsAndMakeColumn(SchemaShape inputSchema, String name, String[] sources)
   at Microsoft.ML.Transforms.ColumnConcatenatingEstimator.GetOutputSchema(SchemaShape inputSchema)
   at Microsoft.ML.Data.EstimatorChain`1.GetOutputSchema(SchemaShape inputSchema)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.AutoML.SuggestedPipeline.ToEstimator(IDataView trainset, IDataView validationSet)
   at Microsoft.ML.AutoML.RunnerUtil.TrainAndScorePipeline[TMetrics](MLContext context, SuggestedPipeline pipeline, IDataView trainData, IDataView validData, String groupId, String labelColumn, IMetricsAgent`1 metricsAgent, ITransformer preprocessorTransform, FileInfo modelFileInfo, DataViewSchema modelInputSchema, IChannel logger)'

Stacktrace

   at Microsoft.ML.AutoML.Experiment`2.Execute()
   at Microsoft.ML.AutoML.ExperimentBase`2.Execute(ColumnInformation columnInfo, DatasetColumnInfo[] columns, IEstimator`1 preFeaturizer, IProgress`1 progressHandler, IRunner`1 runner)
   at Microsoft.ML.AutoML.ExperimentBase`2.ExecuteCrossValSummary(IDataView[] trainDatasets, ColumnInformation columnInfo, IDataView[] validationDatasets, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)
   at Microsoft.ML.AutoML.ExperimentBase`2.Execute(IDataView trainData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)
   at Intense.AI.AutoML.Trainers.MulticlassClassification.RunExperiment(IDataView trainingDataView, RunExperimentDto runExperimentDto) in xxxx.cs:line 63

Not working input

SELECT 
[Name], ISNULL(ShortStringColumn3,'') AS [ShortStringColumn3], 
CAST(IntColumn2 as real) as [IntColumn2], CAST(IntColumn3 as real) as [IntColumn3],  CAST(ISNULL(IntColumn4,0) as real) as [IntColumn4],
cast(IntColumn5 as real) as [LabelColumn] 
FROM Documents
Name            ShortStringColumn3      IntColumn2     IntColumn3     IntColumn4     LabelColumn
--------------- ----------------------- -------------- -------------- -------------- ------------
Administrator                           73             0              0              73
User                                    8              0              1583           8
Administrator                           3              0              1583           73

But on the other hand I have no problem with AutoML training with this dataset (single + string) as input Working input

SELECT
CAST(Customer as real) as [Customer], cast([Name] as varchar(50)) as Name, [IntColumn21] as [LabelColumn]
FROM Documents
Customer      Name       LabelColumn
------------- ---------- -----------
2306          xyz        1
1666          dataaab    1
2323          dataaaa    1
2158          aaaaaac    1
2158          aaaaaab    0
2082          yyyyyy     0
2082          yyyyyy     0

To Reproduce Steps to reproduce the behavior:

var loader = MLContext.Data.CreateDatabaseLoader(columns.ToArray()); var dbSource = new DatabaseSource(SqlClientFactory.Instance, connectionString, sqlQuery); var iDataView = loader.Load(dbSource); experiment.Execute(trainData: iDataView, labelColumnName: "LabelColumn", progressHandler: progressHandler); image

Expected behavior I understand there's problem with connectin string/int columns in input data... (same if int cols are first and strings are later). Why can't allow to auto concatenate all fields?

michaelgsharp commented 1 year ago

@luisquintanilla do you know if we have any work planned around the sql stuff in the near future? I know its been something that has been brought up several times.

luisquintanilla commented 1 year ago

Something I think sticks out here is that the old AutoML APIs are being used. I would recomment using the new APIs. Here is a guide on that.

Re: autoconcat features, the Featurizer can help you do that in the new API. The Featurizer works best when paired with the InferColumns method. Today that doesn't natively work with SQL but here's a sample that shows how you can get it to. I've also created an issue to enable SQL for InferColumns.

6515

rzechu commented 1 year ago

Something I think sticks out here is that the old AutoML APIs are being used. I would recomment using the new APIs. Here is a guide on that.

Re: autoconcat features, the Featurizer can help you do that in the new API. The Featurizer works best when paired with the InferColumns method. Today that doesn't natively work with SQL but here's a sample that shows how you can get it to. I've also created an issue to enable SQL for InferColumns.

6515

Ok thank you for response. I saw samples and docs but 95% examples regarding text loader from CSV or predefined classes. I have to dynamically select database columns and column data types in runtime. That's why I choosed dynamically building SQL and recognizing datatypes. It works well for most of 90+%scenarios. No need to featurizer, concatenating, fit etc But there's some minor cases when API returns errors. I have tried 2.0 preview API. Different error but still error (regarding vector columns) I will investigate if further.

luisquintanilla commented 1 year ago

I have tried 2.0 preview API. Different error but still error (regarding vector columns)

As you run into issues, please file them here so we can investigate. Thanks.

rzechu commented 1 year ago

Similar (SQL input data) but other type columns

6573