dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

SweepableEstimator SearchSpace not being fully explored #7085

Open fwaris opened 3 months ago

fwaris commented 3 months ago

System Information (please complete the following information):

Describe the bug The SearchSpace is not being fully explored for a SweepableEstimator. I have a SweepableEstimator where the search space is for the 'k' for KMeans number of clusters. The range is Min=3, Max=20 and Default = 10. (uniform int). I am logging the selected k parameter when the SweepableEstimator is called. The logs show that k hovers around the default value (i.e. 8,9,10,11). The full space is not explored.

A clear and concise description of what the bug is. The script that showcases this problem is here

https://github.com/fwaris/MLNetGEOpt/blob/master/MLNetGEOpt/scripts/custering.fsx

Expected behavior The search space should be explored more fully

Screenshots, Code, Sample Projects Project: https://github.com/fwaris/MLNetGEOpt

Additional context

Background

The referenced project is a layer of auto ML above the AutoML (of ML.Net). This higher layer is called 'MLNetGEOpt'.

AutoML finds optimal parameters given a SweepablePipeline.

MLNetGEOpt proposes new SweepablePipelines for AutoML to optimize.

It uses a method called "Grammatical Evolution" (GE). The pipelines are constructed according to a given 'grammar'. Each pipeline is a valid 'sentence' constructed from the grammar.

The grammar ensures that the pipelines are reasonable. This greatly reduces the search space - as compared to randomly constructed pipelines - say via a Genetic Algorithm.

Note: I solved for optimal number of clusters by building a grammar that allows for one-of-many SweepableEstimators each tied a particular k.

Here is an example of the grammar (prefix 'se' stands for SweepableEstimator; 'Alt'=select 1 from available options; 'Opt'=optional term):

let g = 
    [
        Estimator seBase
        Opt(Estimator (E.Def.seFtrSelCount 3))        
        Alt [
            Alt ([(1,10); (11,20); (21,30); (31,100)] |> List.map(E.Def.seNorm>>Estimator))
            Estimator E.Def.seNormLpNorm
            Estimator E.Def.seNormLogMeanVar
            Estimator E.Def.seNormMeanVar
            Alt([0.1f .. 0.5f .. 4.0f] |> List.pairwise |> List.map(fun (a,b) -> a, b - 0.001f)  |> List.map(E.Def.seGlobalContrast>>Estimator))
            Estimator E.Def.seNormMinMax
            Estimator E.Def.seNormRobustScaling
        ]
        Alt [for i in 3 .. 20 -> Estimator (seCluster i)]  // this works 
        //Estimator seClusterWithSS                        // this does not work
    ]

For reference, a specific grammar can be constructed from this simple 'meta-grammar':

type Term = 
    | Opt of Term 
    | Pipeline of (unit -> SweepablePipeline)
    | Estimator of (unit -> SweepableEstimator)
    | Alt of Term list
    | Union of Term list
fwaris commented 3 months ago

I just realized that the default tuner is the 'eci cost frugal tuner' that may be searching more narrowly.

I will try with another tuner to see if the search space is explored more fully