How to optimize a machine learning model in ML.NET?

nishaesubhash commented 5 years ago

I have created a machine learning model using ML.NET. The learning algorithm used is multiclass classification algorithm. I have tried regression algorithms also. But output of the model is not acccurate. Is there any optimization techniques in ML.NET to make the model more accurate?

kozulic commented 5 years ago

What kind of ML task you want to solve? Classification OR regression? What about your data set? Actually there are many different "optimization techniques" to make model more accurate. You have to provide more info.

PeterPann23 commented 5 years ago

You can't use the models "out of the box" in production, and documentation is still a real problem as most of it is still pure reflection based.

Have a look at the AutoML wizard, this creates a model and tries a few different parameters. at the end it generates a report that shows the model that was a "winner" based on the parameters and networks it tested.

My model needs to predict a trend, I use this: mlnet auto-train --task multiclass-classification --dataset "H:\ML.NET CLI\StatsTrend.csv" --test-dataset "H:\ML.NET CLI\StatsTrend_Validate.csv" --max-exploration-time 21000 --output-path "H:\ML.NET CLI\StatsTrend" --label-column-name Trend --has-header true -c on

I then have a look at the code generated based on the csv files fed as well as the configurations tested. Then, in the debug_log.txt that get's created in the output folder (subfolder, in my case H:\ML.NET CLI\StatsTrade\SampleBinaryClassification\logs) you would see something like this:


------------------------------------------------------------------------------------------------------------------
|                                                     Summary                                                    |
------------------------------------------------------------------------------------------------------------------
|ML Task: binary-classification                                                                                  |
|Dataset: StatsTrade.csv                                                                                         |
|Label : Trend                                                                                                   |
|Total experiment time : 21003.15 Secs                                                                           |
|Total number of models explored: 146                                                                            |
------------------------------------------------------------------------------------------------------------------
|                                              Top 5 models explored                                             |
------------------------------------------------------------------------------------------------------------------
|     Trainer                              Accuracy      AUC    AUPRC  F1-score  Duration #Iteration             |
|1    FastTreeBinary                         0.9397   0.7372   0.2318    0.0952      43.9        134             |
|2    LightGbmBinary                         0.9397   0.7411   0.2351    0.0902     300.9         43             |
|3    FastTreeBinary                         0.9397   0.7371   0.2303    0.0920     239.3         83             |
|4    FastTreeBinary                         0.9397   0.7324   0.2334    0.0862      42.3        113             |
|5    LightGbmBinary                         0.9397   0.7405   0.2322    0.0905      31.2        124             |
------------------------------------------------------------------------------------------------------------------
Generated trained model for consumption: H:\ML.NET CLI\StatsTrade\SampleBinaryClassification\SampleBinaryClassification.Model\MLModel.zip
Generated C# code for model consumption: H:\ML.NET CLI\StatsTrade\SampleBinaryClassification\SampleBinaryClassification.ConsoleApp
Check out log file for more information: H:\ML.NET CLI\StatsTrade\SampleBinaryClassification\logs\debug_log.txt

I then look at a given F1 score in the log file and find tr=FastTreeBinary{NumberOfLeaves:126, MinimumExampleCountPerLeaf:50, NumberOfTrees:100, LearningRate:0.08834058, Shrinkage:0.6504856} cache=+

Basically that's a FastTreeBinary using the parameter constructor with

NumberOfLeaves:126,
MinimumExampleCountPerLeaf:50,
NumberOfTrees:100,
LearningRate:0.08834058,
Shrinkage:0.6504856

Then play with this by changing 1 parameter and then training again and store the parameter values in a csv together with the metrics and then plot in excel,

I capture by parameters using some simple looping app that changes the "usual subjects"

var options = new FastTreeBinaryTrainer.Options()
{
   NumberOfLeaves              = 126,
   MinimumExampleCountPerLeaf  = 50,
   NumberOfTrees               = 100,
   LearningRate                = 0.08834058f,
   Shrinkage                   = 0.6504856f,
   LabelColumnName             = nameof(BinaryVectorModelInput.Label),
   FeatureColumnName           = nameof(BinaryVectorModelInput.Features)
};

base.StoreOptions(options);
….

//base class
internal void StoreOptions(object options)
{
   foreach(var prop in options.GetType().GetProperties())
   {
       Parameters["prop:"+prop.Name] = prop.GetValue(options)?.ToString();
   }

   foreach(var field in options.GetType().GetFields())
   {
        Parameters["field:" + field.Name] = field.GetValue(options)?.ToString();
    }
}

After this you alter the parameters and loop and loop If not happy I change the input columns and the story starts again

I store by data like this:

public class StatsSetup
{

    public class Metric
    {
        [JsonProperty("changed")]
        private readonly Dictionary<string, (double change, string was, string now)> changed;

        public Metric()
        {
            changed = new Dictionary<string, (double change, string was, string now)>();
        }
        public string Name { get; set; }
        public Dictionary<string, double> Metrics { get; set; }
        public Dictionary<string, string> Parameters { get; set; }
        public string TraingFile { get; set; }
        public double Score { get; set; }

        [JsonIgnore]
        public Dictionary<string, (double change, string was, string now)> Changed => changed;
    }

    [JsonProperty("data")]
    private Dictionary<string,Metric> data;

    public BBTStatsSetup()
    {
        data=new Dictionary<string, Metric>(StringComparer.Ordinal);
    }

    public void Add(ITrainerScore trained)
    {
        var name = new FileInfo(trained.ModelPath).Name.Split('.')[0];
        if(!data.TryGetValue(name, out var item))
        {
            item = new Metric() { Name = name, Metrics = trained.Metrics, Parameters = trained.Parameters };
            data[name] = item;
        }
        else
        {           
            if(item.Score < trained.Score)
            {
                foreach(var kvp in item.Parameters)
                {
                    if(trained.Parameters.TryGetValue(kvp.Key, out string actual) && kvp.Value != null && !kvp.Value.Equals(actual, System.StringComparison.Ordinal))
                        item.Changed[kvp.Key]=(item.Score-trained.Score, was:kvp.Value,now: actual);
                }
            }
            item.Parameters     = trained.Parameters;
            item.Metrics        = trained.Metrics;
            item.Score          = trained.Score;
            item.TraingFile     = trained.ModelPath;
            data[name] = item;
        }

    }

    public double GetScore(ITrainerScore trained)
    {
        var name = new FileInfo(trained.ModelPath).Name.Split('.')[0];
        return data.TryGetValue(name, out var item) ?item.Score : double.NegativeInfinity;

    }

    public IDictionary<string, string> GetParameters(ITrainerScore trained)
    {
        var name = new FileInfo(trained.ModelPath).Name.Split('.')[0];
        if(data.TryGetValue(name, out var par))
        {
            if(trained.Parameters is null || trained.Parameters.Count == 0)
                trained.PopulateParameters(par.Parameters);
            return par.Parameters;
        }
        return new Dictionary<string,string>();

    }

    public IDictionary<string, double> Getmetrics(ITrainerScore trained)
    {
        var name = new FileInfo(trained.ModelPath).Name.Split('.')[0];

        if(data.TryGetValue(name, out var met))
        {
            if(trained.Metrics is null || trained.Metrics.Count == 0)
                trained.PopulateMetrics(met.Metrics);
            return met.Metrics;
        }
        return new Dictionary<string,double>();

    }

    public int ItemsInScore(double score) => data.Count(c => c.Value.Score >= score);

    public Dictionary<string, (double change, string was, string now)> GetChanges(ITrainerScore trained)
    {
        var name = new FileInfo(trained.ModelPath).Name.Split('.')[0];

        if(data.TryGetValue(name, out var item))
        {
            return item.Changed;
        }
        else
        {
            return new Dictionary<string, (double change, string was, string now)>();
        }
    }

}

I use ITrainerScore as an proprietary interface to make sure I get a dictionary for the metrics and parameters (they are not compatible between the networks…

Make a dictionary of metrics is not hard:

protected void PrintMetrics(BinaryClassificationMetrics metrics)
{
            Score = metrics.Accuracy;

            this.Metrics[nameof(BinaryClassificationMetrics.Accuracy)]                      = metrics.Accuracy;
            this.Metrics[nameof(BinaryClassificationMetrics.AreaUnderPrecisionRecallCurve)] = metrics.AreaUnderPrecisionRecallCurve;
            this.Metrics[nameof(BinaryClassificationMetrics.AreaUnderRocCurve)]             = metrics.AreaUnderRocCurve;
            this.Metrics[nameof(BinaryClassificationMetrics.F1Score)]                       = metrics.F1Score;
            this.Metrics[nameof(BinaryClassificationMetrics.NegativePrecision)]             = metrics.NegativePrecision;
            this.Metrics[nameof(BinaryClassificationMetrics.NegativeRecall)]                = metrics.NegativeRecall;
            this.Metrics[nameof(BinaryClassificationMetrics.PositivePrecision)]             = metrics.PositivePrecision;
            this.Metrics[nameof(BinaryClassificationMetrics.PositiveRecall)]                = metrics.PositiveRecall;

            Console.WriteLine($"Accuracy: {metrics.Accuracy:F2}");
            Console.WriteLine($"AUC: {metrics.AreaUnderRocCurve:F2}");
            Console.WriteLine($"F1 Score: {metrics.F1Score:F2}");
            Console.WriteLine($"Negative Precision: {metrics.NegativePrecision:F2}");
            Console.WriteLine($"Negative Recall: {metrics.NegativeRecall:F2}");
            Console.WriteLine($"Positive Precision: {metrics.PositivePrecision:F2}");
            Console.WriteLine($"Positive Recall: {metrics.PositiveRecall:F2}");
            Console.WriteLine(metrics.ConfusionMatrix.GetFormattedConfusionTable());
}

codemzs commented 5 years ago

Thanks @PeterPann23 . Using AutoML is one way to optimize models and other ways would be just through experience.

PeterPann23 commented 5 years ago

experiences = trail and error as there is not enough documentation on the trainer options to get any working model. also the framework lacks visualisation. I think it's more like "Guesstimating" now as one has no idea what its is going to do. You as a developer on the team may have a different opinion, but us consumers should think 2x before recommending this Framework as the TCO can't be estimated

voodookoop commented 3 years ago

I'd couple @PeterPann23 's ML model optimization structure with some kind of genetic algorithm framework for those cases where there are several input parameters for the training operation itself (like for binaries: NumberOfLeaves, MinimumExampleCountPerLeaf as said). Of course it'll take ages to optimize even one single model depending on available machine resources, but it might worth a shot.

dotnet / machinelearning

How to optimize a machine learning model in ML.NET? #4136