dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.04k stars 1.88k forks source link

One or more errors occurred. (Schema mismatch for score column 'Score': expected vector of two or more items of type Single, got Vector<Single, 1> Arg_ParamName_Name) #6573

Open rzechu opened 1 year ago

rzechu commented 1 year ago

System Information (please complete the following information):

Describe the bug One or more errors occurred. (Schema mismatch for score column 'Score': expected vector of two or more items of type Single, got Vector<Single, 1> Arg_ParamName_Name) on AutoML Multiclasstraining

SELECT  
CAST(DecimalColumn1 as real) as [DecimalColumn1], 
CAST(IntColumn2 as real) as [IntColumn2], 
CAST(IntColumn3 as real) as [IntColumn3], 
CAST(IntColumn4 as real) as [IntColumn4], 
cast([LongStringColumn1] as varchar(2000)) as [LongStringColumn1],  /*nvarchar(max) */
[Name], /*nvarchar(128) */
[ShortStringColumn3] as [ShortStringColumn3], /*nvarchar(128) */
cast([LongStringColumn3] as varchar(2000)) as [LabelColumn]  /*nvarchar(max) */
FROM Documents 

image

To Reproduce Steps to reproduce the behavior:

  1. MulticlassClassification
  2. Loading data from MSSQL Database
    var loader = MLContext.Data.CreateDatabaseLoader(columns.ToArray());
    var dbSource = new DatabaseSource(SqlClientFactory.Instance, connectionString, sqlQuery);
    var iDataView = loader.Load(dbSource);
    columns
    [0]: "Name=DecimalColumn1      Type=Single"
    [1]: "Name=IntColumn2               Type=Single"
    [2]: "Name=IntColumn3               Type=Single"
    [3]: "Name=IntColumn4               Type=Single"
    [4]: "Name=LongStringColumn1  Type=String"
    [5]: "Name=Name                        Type=String"
    [6]: "Name=ShortStringColumn3 Type=String"
    [7]: "Name=LabelColumn             Type=String"
iDataView
    [0]: "Index=0 Name=DecimalColumn1     Type=Single IsHidden=False"
    [1]: "Index=1 Name=IntColumn2               Type=Single IsHidden=False"
    [2]: "Index=2 Name=IntColumn3               Type=Single IsHidden=False"
    [3]: "Index=3 Name=IntColumn4               Type=Single IsHidden=False"
    [4]: "Index=4 Name=LongStringColumn1  Type=String IsHidden=False"
    [5]: "Index=5 Name=Name                         Type=String IsHidden=False"
    [6]: "Index=6 Name=ShortStringColumn3 Type=String IsHidden=False"
    [7]: "Index=7 Name=LabelColumn             Type=String IsHidden=False"

iDataView.Preview() image

  1. image

    var experiment = MLContext.Auto().CreateMulticlassClassificationExperiment(multiClassExperimentSettings);
    var experimentResult = experiment.Execute(trainData: iDataView, labelColumnName: runExperimentDto.LabelColumnName, progressHandler: progressHandler);

Error:

Schema mismatch for score column 'Score': expected vector of two or more items of type Single, got Vector<Single, 1> (Parameter 'schema')
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Microsoft.ML.AutoML.AutoMLExperiment.Run()
   at Microsoft.ML.AutoML.MulticlassClassificationExperiment.Execute(IDataView trainData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)
   at AI.AutoML.Trainers.MulticlassClassification.RunExperiment(IDataView trainingDataView, RunExperimentDto runExperimentDto) in C:\.....\_myapp.cs:line 63

InnerException

Schema mismatch for score column 'Score': expected vector of two or more items of type Single, got Vector<Single, 1> (Parameter 'schema')

   at Microsoft.ML.Data.MulticlassClassificationEvaluator.CheckScoreAndLabelTypes(RoleMappedSchema schema)
   at Microsoft.ML.Data.EvaluatorBase`1.CheckColumnTypes(RoleMappedSchema schema)
   at Microsoft.ML.Data.EvaluatorBase`1.Microsoft.ML.Data.IEvaluator.Evaluate(RoleMappedData data)
   at Microsoft.ML.Data.MulticlassClassificationEvaluator.Evaluate(IDataView data, String label, String score, String predictedLabel)
   at Microsoft.ML.MulticlassClassificationCatalog.<>c__DisplayClass6_0.<CrossValidate>b__0(CrossValidationResult x)
   at System.Linq.Enumerable.SelectArrayIterator`2.ToArray()
   at Microsoft.ML.AutoML.MulticlassClassificationRunner.Run(TrialSettings settings)
   at System.Threading.Tasks.Task`1.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__272_0(Object obj)
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
   at Microsoft.ML.AutoML.AutoMLExperiment.<RunAsync>d__26.MoveNext()

image

Expected behavior This case should work (I am using same scenario for many other columns and only few cases gives this error - propably related to nvarchar(max) columns or some kind of mix columns I am many times using 2 varchars + 2 ints as source with int/varchar output and work without any doubts Or should be understandable workaround

LittleLittleCloud commented 1 year ago

How many distinct values do you have for LabelName column? Based on exception message it seems that your label column only have one distinct value?

rzechu commented 1 year ago

How many distinct values do you have for LabelName column? Based on exception message it seems that your label column only have one distinct value?

3 values 2 distinct (a, b, a) https://user-images.githubusercontent.com/45091660/219575314-5246a738-d1c2-4e42-9c2a-8f90225fdfbd.png

LittleLittleCloud commented 1 year ago

@rzechu Is it possible for you to export part of your SQL database as csv and give it to us? We want to try reproducing the issue you have on our side

And is your dataset imbalance in labels? Could it be possible that one label has too less instance and it only appears in test dataset rather than train dataset?

rzechu commented 1 year ago

@rzechu Is it possible for you to export part of your SQL database as csv and give it to us? We want to try reproducing the issue you have on our side


CREATE TABLE [dbo].[AI](
[DecimalColumn1] [decimal](20, 4) NULL,
[IntColumn2] [int] NULL,
[IntColumn3] [int] NULL,
[IntColumn4] [int] NULL,
[LongStringColumn1] [nvarchar](max) NULL,
[Name] [nvarchar](255) NULL,
[ShortStringColumn3] [nvarchar](128) NULL,
[LabelColumn] [nvarchar](max) NULL
) 
GO
INSERT [dbo].[Ai] ([DecimalColumn1], [IntColumn2], [IntColumn3], [IntColumn4], [LongStringColumn1], [Name], [ShortStringColumn3], [LabelColumn]) VALUES (NULL, 73, 0, NULL, N'Paryż', N'Delegacja zagraniczna | Administrator Systemu', NULL, N'Konferencja')
GO
INSERT [dbo].[Ai] ([DecimalColumn1], [IntColumn2], [IntColumn3], [IntColumn4], [LongStringColumn1], [Name], [ShortStringColumn3], [LabelColumn]) VALUES (NULL, 8, 0, 1583, N'Warszawa', N'Delegacja krajowa | Alicja Konieczny', NULL, N'Szkolenie')
GO
INSERT [dbo].[Ai] ([DecimalColumn1], [IntColumn2], [IntColumn3], [IntColumn4], [LongStringColumn1], [Name], [ShortStringColumn3], [LabelColumn]) VALUES (NULL, 73, 0, 1583, N'Warszawa', N'Delegacja krajowa | Administrator System', NULL, N'Konferencja')
GO

SELECT CAST(DecimalColumn1 as real) as [DecimalColumn1], CAST(IntColumn2 as real) as [IntColumn2], CAST(IntColumn3 as real) as [IntColumn3], CAST(IntColumn4 as real) as [IntColumn4], [LongStringColumn1], [Name] as [Name], [ShortStringColumn3] as [ShortStringColumn3], [LabelColumn] as [LabelColumn] FROM IAi


Its just 3 row example - no more rows
Not diving data into datasets I am using AutoML and its done behind the scene as I know

var connectionString = ConnectionString; var columns = new SchemaTools().MapDataTableToDataBaseLoaderColumns(sqlQuery, connectionString); var loader = MLContext.Data.CreateDatabaseLoader(columns.ToArray()); var dbSource = new DatabaseSource(SqlClientFactory.Instance, connectionString, sqlQuery); var iDataView = loader.Load(dbSource); var experiment = MLContext.Auto().CreateMulticlassClassificationExperiment(multiClassExperimentSettings); var experimentResult = experiment.Execute(trainData: iDataView, labelColumnName: runExperimentDto.LabelColumnName, progressHandler: progressHandler);


    using Microsoft.ML.Data;
    using System;
    using System.Collections.Generic;
    using System.Data;
    using System.Data.SqlClient;
    using System.Linq;

    public class SchemaTools
    {
        private readonly static Dictionary<Type, DbType> _TypeToSqlDataType = new Dictionary<Type, DbType>()
        {
            { typeof(String), DbType.String },
            { typeof(Int32), DbType.Int32 },
            { typeof(Decimal),DbType.Single },
            { typeof(Double), DbType.Single },
            { typeof(Single), DbType.Single },
            { typeof(DateTime), DbType.DateTime },
            { typeof(Byte[]), DbType.Binary },
        };

        private static DataTable GetDataBaseSchemaColumns(string sqlQuery, string connectionString)
        {
            using (var connection = new SqlConnection(connectionString))
            using (var command = connection.CreateCommand())
            {
                command.CommandText = sqlQuery;
                connection.Open();

                return command.ExecuteReader(CommandBehavior.SchemaOnly)
                    .GetSchemaTable();
            }
        }

        public IList<DatabaseLoader.Column> MapDataTableToDataBaseLoaderColumns(string sqlQuery, string connectionString)
        {
            var dataSchema = GetDataBaseSchemaColumns(sqlQuery, connectionString);

            return dataSchema.Rows.Cast<DataRow>()
                    .Select(row => MapDataTypeToSqlDataType(row))
                    .Where(w => w.Type != DbType.Binary)
                    .ToList();
        }

        private DatabaseLoader.Column MapDataTypeToSqlDataType(DataRow dataRow)
        {
            DbType dbType;
            var dataType = dataRow.Field<Type>("DataType");
            var columnName = dataRow.Field<string>("ColumnName");
            if (_TypeToSqlDataType.TryGetValue(dataType, out dbType))
            {
                return new DatabaseLoader.Column()
                {
                    Type = dbType,
                    Name = columnName
                };
            }

            throw new NotImplementedException($"{dataType} can't be mapped to SqlDbType");
        }
LittleLittleCloud commented 1 year ago

If there's only three rows of dataset, it's possible that after divided the training set only contains one label value and validation set contains another. (training set contains row 0 and row 2, validation set contains row 1)

To validate that assumption, can you try setting validation dataset using the same IDataView as training dataset which avoids the train-validation split by AutoML API.

rzechu commented 1 year ago

If there's only three rows of dataset, it's possible that after divided the training set only contains one label value and validation set contains another. (training set contains row 0 and row 2, validation set contains row 1)

To validate that assumption, can you try setting validation dataset using the same IDataView as training dataset which avoids the train-validation split by AutoML API.

You are right! I used this SQL query

SELECT 
CAST(DecimalColumn1 as real) as [DecimalColumn1], 
CAST(IntColumn2 as real) as [IntColumn2], 
CAST(IntColumn3 as real) as [IntColumn3], 
CAST(IntColumn4 as real) as [IntColumn4], 
[LongStringColumn1], 
[Name] as [Name], 
[ShortStringColumn3] as [ShortStringColumn3], 
[LabelColumn] as [LabelColumn] 
FROM IAi

UNION ALL
SELECT
CAST(NULL as real) as [DecimalColumn1], 
CAST(8 as real) as [IntColumn2], 
CAST(0 as real) as [IntColumn3], 
CAST(NULL as real) as [IntColumn4], 
'Kraków' as [LongStringColumn1], 
'Delegacja krajowa | Alicja Konieczny' as [Name], 
NULL as [ShortStringColumn3], 
'Szkolenie' as [LabelColumn]

UNION ALL
SELECT
CAST(NULL as real) as [DecimalColumn1], 
CAST(8 as real) as [IntColumn2], 
CAST(0 as real) as [IntColumn3], 
CAST(NULL as real) as [IntColumn4], 
'Kraków' as [LongStringColumn1], 
'Delegacja krajowa | Alicja Konieczny' as [Name], 
NULL as [ShortStringColumn3], 
'Szkolenie' as [LabelColumn]

UNION ALL
SELECT
CAST(NULL as real) as [DecimalColumn1], 
CAST(8 as real) as [IntColumn2], 
CAST(0 as real) as [IntColumn3], 
CAST(NULL as real) as [IntColumn4], 
'Kraków' as [LongStringColumn1], 
'Delegacja krajowa | Alicja Konieczny' as [Name], 
NULL as [ShortStringColumn3], 
'Szkolenie' as [LabelColumn]

UNION ALL
SELECT
CAST(NULL as real) as [DecimalColumn1], 
CAST(73 as real) as [IntColumn2], 
CAST(0 as real) as [IntColumn3], 
CAST(NULL as real) as [IntColumn4], 
'Berlin' as [LongStringColumn1], 
'Delegacja zagraniczna | Administrator Systemu' as [Name], 
NULL as [ShortStringColumn3], 
'Konferencja' as [LabelColumn]

image

And training was done without any doubts. Maybe there's option to add "human readable" error message in upcoming version?