Closed wil70 closed 2 years ago
@wil70
The log in VS model builder indicates the failure is from restoring nuget package and not from AutoML, and it should be ignorable.
And the dataset looks to be imbalanced, where label a+0
appears much less often than the other two labels. In fact the label a+0
only appear 2 times, which might cause label missing in training set (after splitting train/test datasplit, label a+0
is missing in training dataset). And after I manually balance the dataset by increasing a+0
, training result looks better
Shouldn't the Confusion Matrix look like this?
- class 1: 11, 0, 0,
- class 2: 0, 11, 0,
- class 3: 0, 0, 2,
It depends on what your test dataset looks like. I assume your test dataset only contains 3 piece of data?
Thanks @LittleLittleCloud ,
a) Yes I shall have put the same number of inputs for each label or close enough number. Good explanation for training vs testing ds - tY!
b) For the confusion matrix, let's pretend the dataset is the following (first row is header, with eleven a-1, eleven a+1 and Two a+0)
c10,c11
-1,a-1
1,a+1
-1,a-1
1,a+1
0,a+0
1,a+1
1,a+1
-1,a-1
1,a+1
1,a+1
-1,a-1
-1,a-1
-1,a-1
1,a+1
-1,a-1
-1,a-1
-1,a-1
0,a+0
1,a+1
1,a+1
1,a+1
1,a+1
-1,a-1
-1,a-1
Should the confusion Matrix look like if there is a perfect solution?
class 1 'a+1': **11**, 0, 0,
class 2 'a-1': 0, **11**, 0,
class 3 'a+0': 0, 0, **2**,
I guess I'm not understanding the 3 in
- class 1: 0, 0, 0,
- class 2: 0, 3, 0,
- class 3: 0, 0, 0,
thanks
Wil
@wil70
According to that per-class precision you shared
looks like the dataset you use to calculate that matrix has 0 class-1, 3 class-2 and 0 class-3 data, which is in total 3 pieces of data. So did you split your dataset into train and test, using train to train a model and calcuating per-class precision matrix on test dataset?
Thanks @LittleLittleCloud
Yeah, so this is incorrect, isn't it? as we know there are eleven a+1 and a-1 as well as 2 a+0. Except as you said if there is a default split (training vs testing)
I'm new at ML.net. I'm trying to evaluate if ML.net will work out for me with huge dataset later. I wrote this tiny code. Please set the "file" path is to the dataset we mentioned above (with the header c10,c11)
I tried with all the MulticlassClassificationTrainer available for trainerID but I always got the confusion Matrix with 3. I do not know how to split the data yet form the c# api, may be it has a default setup?
var mlContext = new MLContext();
IDataView trainingData = mlContext.Data.LoadFromTextFile<ModelInput2>(
file,
separatorChar: ',', hasHeader: true, trimWhitespace: true);
var cts = new CancellationTokenSource();
var experimentSettings = new MulticlassExperimentSettings();
experimentSettings.MaxExperimentTimeInSeconds = 9 * 3600; // 1800;// 37800;// 120;// 3600;
experimentSettings.CancellationToken = cts.Token;
experimentSettings.CacheBeforeTrainer = CacheBeforeTrainer.Auto;
experimentSettings.Trainers.Clear();
experimentSettings.Trainers.Add(trainerID);
experimentSettings.CacheDirectoryName = null;
Console.WriteLine("Processing " + trainerID.ToString());
MulticlassClassificationExperiment experiment = mlContext.Auto().CreateMulticlassClassificationExperiment(experimentSettings);
ExperimentResult<MulticlassClassificationMetrics> experimentResult = experiment.Execute(trainingData, "Action");
if (experimentResult != null && experimentResult.BestRun != null)
{
MulticlassClassificationMetrics metrics = experimentResult.BestRun.ValidationMetrics;
Console.WriteLine($"BestRun TrainerName: {experimentResult.BestRun.TrainerName}");
Console.WriteLine($" - MicroAccuracy: {metrics.MicroAccuracy}");
Console.WriteLine($" - MacroAccuracy: {metrics.MacroAccuracy}");
Console.WriteLine($" - LogLoss: {metrics.LogLoss}");
Console.WriteLine($" - CLogLossReduction: {metrics.LogLossReduction}");
Console.WriteLine($"\nClass log loss:");
int i = 1;
foreach (double d in metrics.PerClassLogLoss)
{
Console.WriteLine($" - class {i}: {metrics.PerClassLogLoss[i - 1]}");
i++;
}
i = 1;
Console.WriteLine($"\nConfusionMatrix.PerClassPrecision:");
foreach (double d in metrics.ConfusionMatrix.PerClassPrecision)
{
Console.WriteLine($"class {i}: {d}");
i++;
}
i = 1;
Console.WriteLine($"\nConfusionMatrix.PerClassPrecision:");
foreach (IReadOnlyList<double> d1 in metrics.ConfusionMatrix.Counts)
{
Console.Write($" - class {i}: ");
foreach (double d2 in d1)
{
Console.Write($"{d2}, ");
}
i++;
Console.WriteLine();
}
and here is the class to model the input data
public class ModelInput2
{
//[LoadColumn(0), NoColumn]
//public float _459 { get; set; }
//[LoadColumn(2, 9), NoColumn]
//public float _data { get; set; }
[LoadColumn(0)] // c10
public float _460 { get; set; }
[LoadColumn(1)]//, ColumnName("c11")]
public string Action { get; set; }
}
Thanks for your help
Wil
The matrix from MulticlassExperiment
is evaluated on validation dataset and in your case, since you are running a cross-validation (that's default setting for small dataset), which means that 10% of the entire training dataset will be held out as validation dataset, which is 24 * 0.1 ~ 3 pieces of data.
Super, TY that explains why.is there a way to get the confusion matrix of the training dataset vs the validation dataset?
Well, you can always re-evaluate your model with another dataset
IDataView trainData
ITransformer model
var eval = model.Transform(trainData)
var metric = context.Multiclass.Evaluate(eval, label...)
// metric.ConfusionMetrix
Cool, I'm going to close this issue since it seems that the question has been resolved. Feel free to ping me if you have any other questions.
yes we can close it - TY!
Hi @LittleLittleCloud
For my understanding,
My goal is to ignore this new column "c09" from the code during training and consumption. I thought using NoColumn (Attribute) shall do the trick but this is hard to prove?
public class ModelInput
{
[LoadColumn(0), ColumnName(@"c09"), **NoColumn**]
public float C09 { get; set; }
[LoadColumn(1), ColumnName(@"c10")]
public float C10 { get; set; }
[LoadColumn(2), ColumnName(@"c11")]
public string C11 { get; set; }
}
I thought doing a AutoML training with the wizard and specifically marking the column as Hidden would show me how to do it if NoColumn is or not the right way? I tried not having "LoadColumn(...)" but that trigger an error soon as I use IDataView testData = mlContext.Data.LoadFromTextFile
But the MyCode.consumption.cs file generated by the AutoML wizard generated this
/// <summary>
/// model input class for MLModel1.
/// </summary>
#region model input class
public class ModelInput
{
[ColumnName(@"c09")] **// Note: the NoColumn is not there in the code automatically generated?**
public float C09 { get; set; }
[ColumnName(@"c10")]
public float C10 { get; set; }
[ColumnName(@"c11")]
public string C11 { get; set; }
}
#endregion
/// <summary>
/// model output class for MLModel1.
/// </summary>
#region model output class
public class ModelOutput **// Note: Can ModepOuput inherit from ModelInput instead of repeating the fields? I'm asking as in some case you might have thousands of fields...**
{
[ColumnName(@"c09")] **// Note: the NoColumn is not there in the code automatically generated?**
public float C09 { get; set; }
[ColumnName(@"c10")]
public float C10 { get; set; }
[ColumnName(@"c11")]
public uint C11 { get; set; }
[ColumnName(@"Features")]
public float[] Features { get; set; }
[ColumnName(@"PredictedLabel")]
public string PredictedLabel { get; set; }
[ColumnName(@"Score")]
public float[] Score { get; set; }
}
The automatic code generated doesn't seem to ignore c09 neither?
private static string MLNetModelPath = Path.GetFullPath("MLModel1.zip");
public static readonly Lazy<PredictionEngine<ModelInput, ModelOutput>> PredictEngine = new Lazy<PredictionEngine<ModelInput, ModelOutput>>(() => CreatePredictEngine(), true);
/// <summary>
/// Use this method to predict on <see cref="ModelInput"/>.
/// </summary>
/// <param name="input">model input.</param>
/// <returns><seealso cref=" ModelOutput"/></returns>
public static ModelOutput Predict(ModelInput input)
{
var predEngine = PredictEngine.Value;
return predEngine.Predict(input);
}
private static PredictionEngine<ModelInput, ModelOutput> CreatePredictEngine()
{
var mlContext = new MLContext();
ITransformer mlModel = mlContext.Model.Load(MLNetModelPath, out var _);
return mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(mlModel);
}
I guess I can always generate code do add a mlContext.Transform...a NoColumn or Hidden attribute seems a very good shortcut for thousands of columns.
I keep having the an expception "System.ArgumentOutOfRangeException: 'Score column 'Score' not found Parameter name: schema'" and I'm not able to figure out how to make it work:
Stopwatch stopw = new Stopwatch();
stopw.Start();
try
{
var mlContext = new MLContext();
IDataView testData = mlContext.Data.LoadFromTextFile<MLModel1.ModelInput>("S:\\CATS\\files\\data_analysis\\output\\AggregatedFile\\small_2.csv", separatorChar: ',', hasHeader: true, trimWhitespace: true);
DataView trainData = new DataView();
ITransformer mlModel = mlContext.Model.Load(Path.GetFullPath(@"G:\Users\Wilhelm\dev\MachineLearning\ML1\MLModel1.zip"), out var _);
//return mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(mlModel);
//var eval = mlModel.Transform(testData);
MulticlassClassificationMetrics metric = mlContext.MulticlassClassification.Evaluate(testData, "c11");
Console.WriteLine(metric.ConfusionMatrix.GetFormattedConfusionTable()); //PrintConfusionMatrix("LightGbm", metric);
}
catch (Exception e)
{
Console.Out.WriteLine(e.Message);
if (e.InnerException != null) Console.Out.WriteLine(e.InnerException.Message);
if (e.StackTrace != null) Console.Out.WriteLine(e.StackTrace);
}
finally
{
stopw.Stop();
Console.Out.WriteLine("\nDuration: " + stopw.Elapsed);
}
Exception:
System.ArgumentOutOfRangeException
HResult=0x80131502
Message=Score column 'Score' not found
Parameter name: schema
Source=Microsoft.ML.Core
StackTrace:
at Microsoft.ML.Data.RoleMappedSchema.MapFromNames(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
at Microsoft.ML.Data.RoleMappedData..ctor(IDataView data, Boolean opt, KeyValuePair`2[] roles)
at Microsoft.ML.Data.MulticlassClassificationEvaluator.Evaluate(IDataView data, String label, String score, String predictedLabel)
at ML1.Program.Main(String[] args) in G:\Users\Wilhelm\dev\MachineLearning\ML1\Program.cs:line 35
This exception was originally thrown at this call stack:
[External Code]
ML1.Program.Main(string[]) in Program.cs
Thanks a lot for your help
Wil
How to ignore column
You just not mark that column with LoadColumn
and that should be it. If you don't want column c09
, just not put LoadColumn
attribution
I tried not having "LoadColumn(...)" but that trigger an error soon as I use IDataView testData = mlContext.Data.LoadFromTextFile
(....)
what error you have
System.ArgumentOutOfRangeException HResult=0x80131502 Message=Score column 'Score' not found
The error basically says it can't find Score
in testData
, which is true, right? You need to put the evaluation result eval
in Evaluate
api
Thanks @LittleLittleCloud
1) thanks! 2) The problem is you need ModelInput (define here https://github.com/dotnet/machinelearning/issues/6309#issuecomment-1237400143) for reading input data from the csv file (Note: Features, predictedLabels and scores are not in in this class), but then you need those 3 fields to evaluate, so I have ModelOuput with those (ModelOutput inherit from ModelInput but add those 3 columns Features, predictedLabels and scores) to evaluate.... I could create a new input file with the 3 extra fields with "empty or default" values but imagine for a file of 330GB or 2TB...
Basically, how do I feed ModelOuput to "mlContext.MulticlassClassification.Evaluate(testData, "c11");" knowing the testdata has been created with ModelInput?
Note: I believe I tried to add those 3 fields in ModelInput and not mark those 3 fields as LoadColumn but I think, if I recall well, it failed
Thank for your help
Wil
hi @wil70
I might not express clearly in topic 2. According to your response, you are using the following code to evaluate model
var mlContext = new MLContext();
IDataView testData = mlContext.Data.LoadFromTextFile<MLModel1.ModelInput>("S:\\CATS\\files\\data_analysis\\output\\AggregatedFile\\small_2.csv", separatorChar: ',', hasHeader: true, trimWhitespace: true);
DataView trainData = new DataView();
ITransformer mlModel = mlContext.Model.Load(Path.GetFullPath(@"G:\Users\Wilhelm\dev\MachineLearning\ML1\MLModel1.zip"), out var _);
//return mlContext.Model.CreatePredictionEngine<ModelInput, ModelOutput>(mlModel);
//var eval = mlModel.Transform(testData);
MulticlassClassificationMetrics metric = mlContext.MulticlassClassification.Evaluate(testData, "c11");
Console.WriteLine(metric.ConfusionMatrix.GetFormattedConfusionTable()); //PrintConfusionMatrix("LightGbm", metric);
which throws ArgumentOutOfRangeException
.
The cause is because in this two line
//var eval = mlModel.Transform(testData);
MulticlassClassificationMetrics metric = mlContext.MulticlassClassification.Evaluate(testData, "c11");
you are using testData
instead of eval
to evaluate your result. You need to first get eval
by using your model to transform testData
, and then evaluate metrics on eval
with Evaluate
api. So the right code should be
var eval = mlModel.Transform(testData);
MulticlassClassificationMetrics metric = mlContext.MulticlassClassification.Evaluate(eval , "c11");
The column Score
and PredictedLabel
will be added by mlModel
during transforming.
Super - TY!
I see so only the IDataView from Transform will have the flexibility to add new column dynamically whereas the one from the LoadFromTextFile wouldn't - Super TY @LittleLittleCloud
I'm trying to start from a saved model so I can analyze results and add more training time if needed for that model, I dug (and will dig more over the weekend) into Fit with something like this:
var eval = mlModel.Transform(traindata);
MulticlassPredictionTransformer<OneVersusAllModelParameters> transformer = mlContext.MulticlassClassification.Trainers.LightGbm(LABEL).Fit(eval);
It runs and I'm guessing is retraining (with the Fit above) but I'm not able to reuse the saved newly retrain model (from transformer above). I'm doing this:
mlContext.Model.Save(**transformer**, trainingData.Schema, "c:\\model_LightGbmMulti.zip");
I think it doesn't save the right retrain model. The schema of the retrained model saved must be correct as it shall be the same as the initial model's schema we started from to retrain. In other words, the pre-model's schema should be the same as the post-model's schema, the model shall be different though.
But when I try to load the newly retrained model and I get an exception
var mlContext = new MLContext();
IDataView testData = mlContext.Data.LoadFromTextFile<ModelInput>(file, separatorChar: ',', hasHeader: true, trimWhitespace: true);
ITransformer mlModel = mlContext.Model.Load(MLNetModelPath, out var _);
var eval = mlModel.Transform(testData);
MulticlassClassificationMetrics metric = mlContext.MulticlassClassification.Evaluate(eval, LABEL);
Console.WriteLine(metrics.ConfusionMatrix.GetFormattedConfusionTable());
It gives me this exception and I think it is not a schema issue as they should be the same but the retrain model saved?
Features column '**Feature**' not found (Parameter '**schema**')
at Microsoft.ML.Data.RoleMappedSchema.MapFromNames(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
at Microsoft.ML.Data.RoleMappedSchema..ctor(DataViewSchema schema, IEnumerable`1 roles, Boolean opt)
at Microsoft.ML.Data.PredictedLabelScorerBase.BindingsImpl.ApplyToSchema(DataViewSchema input, ISchemaBindableMapper bindable, IHostEnvironment env)
at Microsoft.ML.Data.PredictedLabelScorerBase..ctor(IHostEnvironment env, PredictedLabelScorerBase transform, IDataView newSource, String registrationName)
at Microsoft.ML.Data.MulticlassClassificationScorer..ctor(IHostEnvironment env, MulticlassClassificationScorer transform, IDataView newSource)
at Microsoft.ML.Data.MulticlassClassificationScorer.ApplyToDataCore(IHostEnvironment env, IDataView newSource)
at Microsoft.ML.Data.RowToRowScorerBase.ApplyToData(IHostEnvironment env, IDataView newSource)
at Microsoft.ML.Data.PredictionTransformerBase`1.Transform(IDataView input)
at ML1.Program.TestModel(String file) in G:\Users\Wilhelm\dev\MachineLearning\ML2\Program.cs:line 188
I can always do the following and this work, but this is way too cumbersome when you have thousands of columns There mut be a simpler way than the following V to save the retrain model?
var pipeline = mlContext.Transforms.ReplaceMissingValues(@"c10", @"c10")
.Append(mlContext.Transforms.Concatenate(@"Features", new []{@"c10"}))
.Append(mlContext.Transforms.Conversion.MapValueToKey(outputColumnName:@"c11",inputColumnName:@"c11"))
.Append(mlContext.Transforms.NormalizeMinMax(@"Features", @"Features"))
.Append(mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy(new SdcaMaximumEntropyMulticlassTrainer.Options(){L1Regularization=1F,L2Regularization=0.1F,LabelColumnName=@"c11",FeatureColumnName=@"Features"}))
.Append(mlContext.Transforms.Conversion.MapKeyToValue(outputColumnName:@"PredictedLabel",inputColumnName:@"PredictedLabel"));
var model = pipeline.Fit(trainData);
mlContext.Model.Save(**model**, trainingData.Schema, "c:\\model_LightGbmMulti.zip");
Note: I read different articles like https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/retrain-model-ml-net but I'm not able to figure it out.
Is it possible to save the model every x minutes/hours/iterations so I can evaluate it...some kind of callback
I need to dig into (Microsoft.ML.Data.IInternalCatalog)mlContext.MulticlassClassification.Trainers).Environment.ProgressTracker
Later ideally, I would like to chart the training result and test data set result thought time/x-iterations
Please let me know if you know a book or some good articles to guides me.
Thank you!
Wil cc: @LittleLittleCloud
Hello, most of my code use double, I'm making sure it load as Single as ML.net support Single and not double from what I understand.
How can I know which ML.net Algorithm can handle positive or negative zero, PositiveInfinity, NegativeInfinity, and not a number (NaN)?
Those have values have semantical significance and might be interesting to keep those for ML.net algorithms that can handle them, for the algorithms that can not handle them I will transform those data somehow.
Thanks
I'm have simple data set with 2 fields: c10 and c11. c10 is a float, c11 is a string. First row is the header. c10,c11 -1,a-1 1,a+1 -1,a-1 1,a+1 0,a+0 1,a+1 1,a+1 -1,a-1 1,a+1 1,a+1 -1,a-1 -1,a-1 -1,a-1 1,a+1 -1,a-1 -1,a-1 -1,a-1 0,a+0 1,a+1 1,a+1 1,a+1 1,a+1 -1,a-1 -1,a-1
As you can see this is very easy to solve visually:
If I run AutoML with the VS builder UI, it crash at the end with this
Here is the log
I extended the time to train
Crash:
Log:
I would have excepted it to have found a good results
So I tried with 1 algo via c# and I got this:
Shouldn't the Confusion Matrix look like this?
I'm running the other Trainer and will post once done...right now it seems stuck on AveragedPerceptronOva, its been 20 min already....
Update after some time (I kill the program when too long so it can move to the next trainer):
===================================== More or less related issues: