Open rzechu opened 1 year ago
How many distinct values do you have for LabelName
column? Based on exception message it seems that your label column only have one distinct value?
How many distinct values do you have for
LabelName
column? Based on exception message it seems that your label column only have one distinct value?
3 values 2 distinct (a, b, a) https://user-images.githubusercontent.com/45091660/219575314-5246a738-d1c2-4e42-9c2a-8f90225fdfbd.png
@rzechu Is it possible for you to export part of your SQL database as csv and give it to us? We want to try reproducing the issue you have on our side
And is your dataset imbalance in labels? Could it be possible that one label has too less instance and it only appears in test dataset rather than train dataset?
@rzechu Is it possible for you to export part of your SQL database as csv and give it to us? We want to try reproducing the issue you have on our side
CREATE TABLE [dbo].[AI]( [DecimalColumn1] [decimal](20, 4) NULL, [IntColumn2] [int] NULL, [IntColumn3] [int] NULL, [IntColumn4] [int] NULL, [LongStringColumn1] [nvarchar](max) NULL, [Name] [nvarchar](255) NULL, [ShortStringColumn3] [nvarchar](128) NULL, [LabelColumn] [nvarchar](max) NULL ) GO INSERT [dbo].[Ai] ([DecimalColumn1], [IntColumn2], [IntColumn3], [IntColumn4], [LongStringColumn1], [Name], [ShortStringColumn3], [LabelColumn]) VALUES (NULL, 73, 0, NULL, N'Paryż', N'Delegacja zagraniczna | Administrator Systemu', NULL, N'Konferencja') GO INSERT [dbo].[Ai] ([DecimalColumn1], [IntColumn2], [IntColumn3], [IntColumn4], [LongStringColumn1], [Name], [ShortStringColumn3], [LabelColumn]) VALUES (NULL, 8, 0, 1583, N'Warszawa', N'Delegacja krajowa | Alicja Konieczny', NULL, N'Szkolenie') GO INSERT [dbo].[Ai] ([DecimalColumn1], [IntColumn2], [IntColumn3], [IntColumn4], [LongStringColumn1], [Name], [ShortStringColumn3], [LabelColumn]) VALUES (NULL, 73, 0, 1583, N'Warszawa', N'Delegacja krajowa | Administrator System', NULL, N'Konferencja') GO
SELECT CAST(DecimalColumn1 as real) as [DecimalColumn1], CAST(IntColumn2 as real) as [IntColumn2], CAST(IntColumn3 as real) as [IntColumn3], CAST(IntColumn4 as real) as [IntColumn4], [LongStringColumn1], [Name] as [Name], [ShortStringColumn3] as [ShortStringColumn3], [LabelColumn] as [LabelColumn] FROM IAi
Its just 3 row example - no more rows
Not diving data into datasets I am using AutoML and its done behind the scene as I know
var connectionString = ConnectionString; var columns = new SchemaTools().MapDataTableToDataBaseLoaderColumns(sqlQuery, connectionString); var loader = MLContext.Data.CreateDatabaseLoader(columns.ToArray()); var dbSource = new DatabaseSource(SqlClientFactory.Instance, connectionString, sqlQuery); var iDataView = loader.Load(dbSource); var experiment = MLContext.Auto().CreateMulticlassClassificationExperiment(multiClassExperimentSettings); var experimentResult = experiment.Execute(trainData: iDataView, labelColumnName: runExperimentDto.LabelColumnName, progressHandler: progressHandler);
using Microsoft.ML.Data;
using System;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Linq;
public class SchemaTools
{
private readonly static Dictionary<Type, DbType> _TypeToSqlDataType = new Dictionary<Type, DbType>()
{
{ typeof(String), DbType.String },
{ typeof(Int32), DbType.Int32 },
{ typeof(Decimal),DbType.Single },
{ typeof(Double), DbType.Single },
{ typeof(Single), DbType.Single },
{ typeof(DateTime), DbType.DateTime },
{ typeof(Byte[]), DbType.Binary },
};
private static DataTable GetDataBaseSchemaColumns(string sqlQuery, string connectionString)
{
using (var connection = new SqlConnection(connectionString))
using (var command = connection.CreateCommand())
{
command.CommandText = sqlQuery;
connection.Open();
return command.ExecuteReader(CommandBehavior.SchemaOnly)
.GetSchemaTable();
}
}
public IList<DatabaseLoader.Column> MapDataTableToDataBaseLoaderColumns(string sqlQuery, string connectionString)
{
var dataSchema = GetDataBaseSchemaColumns(sqlQuery, connectionString);
return dataSchema.Rows.Cast<DataRow>()
.Select(row => MapDataTypeToSqlDataType(row))
.Where(w => w.Type != DbType.Binary)
.ToList();
}
private DatabaseLoader.Column MapDataTypeToSqlDataType(DataRow dataRow)
{
DbType dbType;
var dataType = dataRow.Field<Type>("DataType");
var columnName = dataRow.Field<string>("ColumnName");
if (_TypeToSqlDataType.TryGetValue(dataType, out dbType))
{
return new DatabaseLoader.Column()
{
Type = dbType,
Name = columnName
};
}
throw new NotImplementedException($"{dataType} can't be mapped to SqlDbType");
}
If there's only three rows of dataset, it's possible that after divided the training set only contains one label value and validation set contains another. (training set contains row 0 and row 2, validation set contains row 1)
To validate that assumption, can you try setting validation dataset using the same IDataView as training dataset which avoids the train-validation split by AutoML API.
If there's only three rows of dataset, it's possible that after divided the training set only contains one label value and validation set contains another. (training set contains row 0 and row 2, validation set contains row 1)
To validate that assumption, can you try setting validation dataset using the same IDataView as training dataset which avoids the train-validation split by AutoML API.
You are right! I used this SQL query
SELECT
CAST(DecimalColumn1 as real) as [DecimalColumn1],
CAST(IntColumn2 as real) as [IntColumn2],
CAST(IntColumn3 as real) as [IntColumn3],
CAST(IntColumn4 as real) as [IntColumn4],
[LongStringColumn1],
[Name] as [Name],
[ShortStringColumn3] as [ShortStringColumn3],
[LabelColumn] as [LabelColumn]
FROM IAi
UNION ALL
SELECT
CAST(NULL as real) as [DecimalColumn1],
CAST(8 as real) as [IntColumn2],
CAST(0 as real) as [IntColumn3],
CAST(NULL as real) as [IntColumn4],
'Kraków' as [LongStringColumn1],
'Delegacja krajowa | Alicja Konieczny' as [Name],
NULL as [ShortStringColumn3],
'Szkolenie' as [LabelColumn]
UNION ALL
SELECT
CAST(NULL as real) as [DecimalColumn1],
CAST(8 as real) as [IntColumn2],
CAST(0 as real) as [IntColumn3],
CAST(NULL as real) as [IntColumn4],
'Kraków' as [LongStringColumn1],
'Delegacja krajowa | Alicja Konieczny' as [Name],
NULL as [ShortStringColumn3],
'Szkolenie' as [LabelColumn]
UNION ALL
SELECT
CAST(NULL as real) as [DecimalColumn1],
CAST(8 as real) as [IntColumn2],
CAST(0 as real) as [IntColumn3],
CAST(NULL as real) as [IntColumn4],
'Kraków' as [LongStringColumn1],
'Delegacja krajowa | Alicja Konieczny' as [Name],
NULL as [ShortStringColumn3],
'Szkolenie' as [LabelColumn]
UNION ALL
SELECT
CAST(NULL as real) as [DecimalColumn1],
CAST(73 as real) as [IntColumn2],
CAST(0 as real) as [IntColumn3],
CAST(NULL as real) as [IntColumn4],
'Berlin' as [LongStringColumn1],
'Delegacja zagraniczna | Administrator Systemu' as [Name],
NULL as [ShortStringColumn3],
'Konferencja' as [LabelColumn]
And training was done without any doubts. Maybe there's option to add "human readable" error message in upcoming version?
System Information (please complete the following information):
Describe the bug One or more errors occurred. (Schema mismatch for score column 'Score': expected vector of two or more items of type Single, got Vector<Single, 1> Arg_ParamName_Name) on AutoML Multiclasstraining
To Reproduce Steps to reproduce the behavior:
iDataView.Preview()
Error:
InnerException
Expected behavior This case should work (I am using same scenario for many other columns and only few cases gives this error - propably related to nvarchar(max) columns or some kind of mix columns I am many times using 2 varchars + 2 ints as source with int/varchar output and work without any doubts Or should be understandable workaround