dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Proposal: AutoML Sweepable API #5992

Closed LittleLittleCloud closed 2 years ago

LittleLittleCloud commented 2 years ago

AutoML.Net Sweepable API proposal

Overview

Sweepable api allows mlnet users to create their own search space and pipeline for hyper-parameter optimization (HPO). It comes with three major part: search space, sweepable estimator/pipeline and tuner. And all API lives under Sweepable() extension (for now).

search space

Search space defines a range of hyper-parameter for tuner to search from. Sweepable API provides two way to create a search space.

via attribute

public class Option
{
    [Range(2, 32768, init: 2, logBase: true)]
    public int WindowSize {get; set;}

    // one of [2, 3, 4]
    [Choice(2, 3, 4)]
    public int SeriesLength {get; set;}

    // one of [true, false]
    [Choice]
    public bool UseSoftmax {get; set;}

    // nested search space
    [Option]
    public Option AnotherOption {get;set;}
}

var ss = new SearchSpace<Option>();

// each search space has a 1-d feature space where each feature is [0, 1). And search space will handle the mapping between hpo space and feature space so that tuner only needs to perform search on feature space, which both dimension and range are known.
var parameter = ss.SampleFromFeatureSpace(new []{0,0,0,0,0,0});

// auto-binding
parameter.WindowSize.Should().Be(2);
parameter.SeriesLength.Should().Be(2);
parameter.UseSoftmax.Should().BeTrue();
parameter.AnotherOption.WindowSize.Should().Be(2);

// search space can also map parameter back to feature space
ss.MappingToFeatureSpace(parameter).Should().BeEquivalantTo(0.0, 0.0, 0.0, 0.0, 0.0, 0.0);

or correspondingly, via scratch

var ss = new SearchSpace();
ss.Add("WindowSize", new UniformIntOption(2, 32768, true, 2));
ss.Add("SeriesLength", new ChoiceOption(2,3,4));
ss.Add("UseSoftmax", new ChoiceOption(true, false));
ss.Add("AnotherOption", ss.Clone());

var parameter = ss.SampleFromFeatureSpace(new []{0,0,0,0,0,0});

// auto-binding doesn't exist for scratch api
parameter["WindowSize"].AsType<int>.Should().Be(2);
parameter["SeriesLength"].AsType<int>.Should().Be(2);
parameter["UseSoftmax"].AsType<bool>.Should().BeTrue();
parameter["AnotherOption"]["WindowSize"].AsType<int>.Should().Be(2);

// search space can also map parameter back to feature space
ss.MappingToFeatureSpace(parameter).Should().BeEquivalantTo(0.0, 0.0, 0.0, 0.0, 0.0, 0.0);

Currently, in order to make auto-binding work, there's a limitation on the parameter type that can be added to search space, which has to be either a Json primitive type or a nested search space.

sweepable estimator

sweepable estimator allows user to combine search space with estimators in a similar way how ml.net estimator/pipeline are created. You use CreateSweepableEstimator, which accepts a lambda function and a search space to create a sweepable estimator. And it also provides .Append extension method so you can append sweepable estimator similiar with how you append other ml.net extimator. The bellow example presents how to create a pipeline with two sweepable estimators, one for text featurizor and the other for fast tree, for titanic dataset via Sweepable() extension.

var context = new MLContext();
var fastTreeSS = new SearchSpace<FastTreeOption>();
var textFeaturizeSS = new SearchSpace<FeaturizeTextOption>();

var pipeline = context.Transforms.Categorical.OneHotEncoding(new[] { new InputOutputColumnPair(@"Sex", @"Sex"), new InputOutputColumnPair(@"Embarked", @"Embarked") })
                .Append(context.Transforms.Concatenate(@"TextFeature", @"Name", "Ticket", "Cabin"))
                .Append(context.Sweepable().CreateSweepableEstimator(
                    (mlContext, option) =>
                    {
                        var textOption = new TextFeaturizingEstimator.Options
                        {
                            CaseMode = option.CaseMode,
                            KeepDiacritics = option.KeepDiacritics,
                            KeepNumbers = option.KeepNumbers,
                            KeepPunctuations = option.KeepPunctuations,
                            CharFeatureExtractor = new WordBagEstimator.Options()
                            {
                                NgramLength = option.WordBagEstimatorOption.NgramLength,
                                UseAllLengths = option.WordBagEstimatorOption.UseAllLengths,
                                Weighting = option.WordBagEstimatorOption.WeightingCriteria,
                            },
                        };

                        return mlContext.Transforms.Text.FeaturizeText("TextFeature", textOption);
                    },
                    textFeaturizeSS))
                .Append(context.Transforms.Concatenate(@"Features", new[] { @"Sex", @"Embarked", @"Pclass", @"Age", @"SibSp", @"Parch", @"Fare", "TextFeature" }))
                .Append(context.Transforms.Conversion.ConvertType("Survived", "Survived", Data.DataKind.Boolean))
                .Append(context.Sweepable().CreateSweepableEstimator(
                    (mlContext, option) => mlContext.BinaryClassification.Trainers.FastForest(labelColumnName: "Survived", featureColumnName: "Features", numberOfLeaves: option.NumberOfLeavenumberOfTrees: option.NumberOfTrees),
                    fastTreeSS))
                .Append(context.BinaryClassification.Calibrators.Naive(labelColumnName: @"Survived", scoreColumnName: @"Score"));

After sweepable pipeline is created, one can call BuildTrainingPipeline to convert it to a ml.net pipeline.

tuner

tuner takes in a search space and performs hpo algos. There're a few default tuning alogs provided by Sweepable API (grid search/random search), and there'll be more smart hpo algos coming soon.

The way to use a tuner is quite similar with the way to use an enumerator. The code below shows how tuner works with search space and sweepable estimator/pipeline

var ss = pipeline.SearchSpace
var tuner = new GridSearchTuner(ss);
var df = DataFrame.LoadCsv(@"titanic.csv");
var trainTestSplit = context.Data.TrainTestSplit(df, 0.1);
var bestAccuracy = 0.0;
var i = 0;
foreach (var param in tuner.Propose())
{
    Console.WriteLine($"trial {i++}");

    // convert sweepable pipeline to ml.net pipeline
    var trainingPipeline = pipeline.BuildTrainingPipeline(context, param);
    var model = trainingPipeline.Fit(trainTestSplit.TrainSet);
    var eval = model.Transform(trainTestSplit.TestSet);
    var accuracy = context.BinaryClassification.Evaluate(eval, "Survived").Accuracy;
    if (accuracy > bestAccuracy)
    {
        Console.WriteLine("Found best accuracy");
        Console.WriteLine("Current best parameter");
        Console.WriteLine(JsonConvert.SerializeObject(param));
        bestAccuracy = accuracy;

        Console.WriteLine($"Trial {i}: Current Best Accuracy {bestAccuracy}, Current Accuracy {accuracy}");
    }
}

You can visit here to try out the complete training code.

The difference between sweepable api and existing api in AutoML.Net

The exsiting API in AutoML.Net performs hpo on pre-defined search space and learners with smac tuning algo while sweepable api allows user to customize those settings: they can define their own search space, they can create pipeline similarly with how it is created in ml.net, and they can pick the tuner which suit their experiment the best.

Q & A

What's the difference between Sweepable API and AutoML.Net experiments

Will AutoML.Net benefits from Sweepable API

Why do we need sweepable API, who will be the beneficaries, what's the most common user-case.

What's the timeline for intergrating Sweepable API into ML.Net

Current feedback

JakeRadMSFT commented 2 years ago

Converted to PR -

https://github.com/dotnet/machinelearning/pull/5993