filter by clustering result

superichmann commented 10 months ago

KMeans creates a new number column. FilterRowsByColumn filters by a number column Maybe perhaps enable the option to actually yes filter by the column which is the result of clustering?

LittleLittleCloud commented 10 months ago

I'm trying to follow your question. Are you using filterRowsByColumn in order to only keep the row that you're interested? For example, the rows that been classified or clustered into a specific category?

superichmann commented 10 months ago

yes FilterRowsByColumndoes not accept the output from KMeans, trying to convert with MapKeyToValueresults in the same error as in #6587

LittleLittleCloud commented 10 months ago

@superichmann

Can you share the code you have? The error basically indicates that ML.Net can't find the key to value mapping in its model file, and how the key to value mapping being serialized is depends on your pipeline.

superichmann commented 9 months ago

Sure, here is my code:

using System;
using System.IO;
using Microsoft.ML;
using Microsoft.ML.Data;
namespace IrisFlowerClustering
{
    class Program
    {
        static readonly string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "iris.data");
        static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "IrisClusteringModel.zip");
        static void Main(string[] args)
        {
            var mlContext = new MLContext(seed: 0);
            IDataView dataView = mlContext.Data.LoadFromTextFile<IrisData>(_dataPath, hasHeader: false, separatorChar: ',');
            string featuresColumnName = "Features";
            var pipeline = mlContext.Transforms
                .Concatenate(featuresColumnName, "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
                .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3))
                .Append(mlContext.Transforms.Conversion.MapKeyToValue("BadFunctionNotActuallyWorkingWellPleaseFixItForMe", "PredictedLabel"));
            var model = pipeline.Fit(dataView);
            using (var fileStream = new FileStream(_modelPath, FileMode.Create, FileAccess.Write, FileShare.Write))
            {
                mlContext.Model.Save(model, dataView.Schema, fileStream);
            }
            var predictor = mlContext.Model.CreatePredictionEngine<IrisData, ClusterPrediction>(model);
            var prediction = predictor.Predict(TestIrisData.Setosa);
            Console.WriteLine($"Cluster: {prediction.PredictedClusterId}");
            Console.WriteLine($"Distances: {string.Join(" ", prediction.Distances)}");
        }
    }
}

LittleLittleCloud commented 9 months ago

You need to add a MapKeyToValue estimator before k means trainer

Get Outlook for iOShttps://aka.ms/o0ukef

From: superichmann @.> Sent: Saturday, September 23, 2023 1:24:46 AM To: dotnet/machinelearning @.> Cc: Comment @.>; Subscribed @.> Subject: Re: [dotnet/machinelearning] filter by clustering result (Issue #6832)

Sure, here is my code:

using System; using System.IO; using Microsoft.ML; using Microsoft.ML.Data; namespace IrisFlowerClustering { class Program { static readonly string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "iris.data"); static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "IrisClusteringModel.zip"); static void Main(string[] args) { var mlContext = new MLContext(seed: 0); IDataView dataView = mlContext.Data.LoadFromTextFile(_dataPath, hasHeader: false, separatorChar: ','); string featuresColumnName = "Features"; var pipeline = mlContext.Transforms .Concatenate(featuresColumnName, "SepalLength", "SepalWidth", "PetalLength", "PetalWidth") .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3)) .Append(mlContext.Transforms.Conversion.MapKeyToValue("BadFunctionNotActuallyWorkingWellPleaseFixItForMe", "PredictedLabel")); var model = pipeline.Fit(dataView); using (var fileStream = new FileStream(_modelPath, FileMode.Create, FileAccess.Write, FileShare.Write)) { mlContext.Model.Save(model, dataView.Schema, fileStream); } var predictor = mlContext.Model.CreatePredictionEngine<IrisData, ClusterPrediction>(model); var prediction = predictor.Predict(TestIrisData.Setosa); Console.WriteLine($"Cluster: {prediction.PredictedClusterId}"); Console.WriteLine($"Distances: {string.Join(" ", prediction.Distances)}"); } } }

— Reply to this email directly, view it on GitHubhttps://github.com/dotnet/machinelearning/issues/6832#issuecomment-1732252432 or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEAYLOW7AQZWNWZL6BHR2ATX32ME7BFKMF2HI4TJMJ2XIZLTSWBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVE4TEMJXHA3TCMJYURXGC3LFVFUGC427NRQWEZLMQKSXMYLMOVS2UMZXG4ZTENZZHE2TTJDOMFWWLKLIMFZV63DBMJSWZLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOKIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVEYTGMRQGIYTCNRWQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHEYDGNBXGA3DEMECUR2HS4DFUVWGCYTFNSSXMYLMOVS2SOJSGE3TQNZRGE4IFJDUPFYGLJLMMFRGK3FFOZQWY5LFVIZTONZTGI3TSOJVHGTXI4TJM5TWK4VGMNZGKYLUMU. You are receiving this email because you commented on the thread.

Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

superichmann commented 9 months ago

I tried several stuff but all of them cause exception.

@LittleLittleCloud Can you please give me a working code example that uses clustering and after filter the result IDataView by a specific cluster value?

LittleLittleCloud commented 9 months ago

using System;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;

public class DataPoint
{
    [VectorType(2)] public float[] Features;
    public string Label;
}

public class ClusteringPrediction
{
    [ColumnName("PredictedLabel")]
    public uint SelectedClusterId;
    [ColumnName("Score")]
    public float[] Distance;
}

var data = new[]
{
    new DataPoint { Features = new float[] { 1, 1 }, Label = "a" },
    new DataPoint { Features = new float[] { 1, 2 }, Label = "a" },
    new DataPoint { Features = new float[] { 2, 1 }, Label = "a" },
    new DataPoint { Features = new float[] { 5, 5 }, Label = "b" },
    new DataPoint { Features = new float[] { 5, 6 }, Label = "b" },
    new DataPoint { Features = new float[] { 6, 5 }, Label = "b" },
    new DataPoint { Features = new float[] { 9, 9 }, Label = "c" },
    new DataPoint { Features = new float[] { 9, 10 }, Label = "c" },
    new DataPoint { Features = new float[] { 10, 9 }, Label = "c" },
};

var context = new MLContext();

var dataView = context.Data.LoadFromEnumerable(data);

var kmeans = context.Transforms.Conversion.MapValueToKey("Label")
    .Append(context.Transforms.NormalizeMinMax("Features"))
    .Append(context.Transforms.Concatenate("Features", "Features"))
    .Append(context.Transforms.NormalizeLpNorm("Features"))
    .Append(context.Clustering.Trainers.KMeans("Features", numberOfClusters: 3))
    .Append(context.Transforms.Conversion.MapKeyToValue("Label"));

var kmeansModel = kmeans.Fit(dataView );

var clusteredData = kmeansModel.Transform(dataView );

foreach (var selectedClusterId in new []{1,2,3})
{
    var clusterRows = context.Data.CreateEnumerable<ClusteringPrediction>(clusteredData, reuseRowObject: false)
    .Where(x => x.SelectedClusterId == selectedClusterId);
    Console.WriteLine($"Rows of the cluster: with cluster id: {selectedClusterId}");
    foreach (var row in clusterRows)
    {
        Console.WriteLine($"ClusterId: {row.SelectedClusterId}, Distance: {string.Join(", ", row.Distance)}");
    }

}

context.Model.Save(kmeansModel, transformedData.Schema, "kmeans.mlnet");

output:

Rows of the cluster: with cluster id: 1
ClusterId: 1, Distance: 2.503395E-06, 0.10263336, 0.10263336
ClusterId: 1, Distance: 2.503395E-06, 0.10263336, 0.10263336
ClusterId: 1, Distance: 0.008203328, 0.053165615, 0.16768521
ClusterId: 1, Distance: 0.008203328, 0.16768521, 0.053165615
ClusterId: 1, Distance: 2.503395E-06, 0.10263336, 0.10263336
ClusterId: 1, Distance: 0.0027624965, 0.07201475, 0.13849694
ClusterId: 1, Distance: 0.0027624965, 0.13849694, 0.07201475
Rows of the cluster: with cluster id: 2
ClusterId: 2, Distance: 0.10247493, 0, 0.39999998
Rows of the cluster: with cluster id: 3
ClusterId: 3, Distance: 0.10247493, 0.39999998, 0

superichmann commented 9 months ago

Hi, Thanks so much for the elaborated response. Your code produces the following error: Error CS0103 The name 'transformedData' does not exist in the current context

I need to do clustering on an IDataView and then create separate IDataViews - one for each cluster, and then run ml on each of the new IDataViews.

My data contains > 1000 columns of many kinds and I cannot create classes for them.

@LittleLittleCloud Is there a way which I can create separate IDataViews with the original data - one for each cluster?

superichmann commented 9 months ago

@LittleLittleCloud Anything?? Also saying that it is not possible and not supported is ok just say something

superichmann commented 9 months ago

As well when sending clustered IDataView to automl experiment and setting columninformation ci.CategoricalColumnNames.Add("PredictedLabel"); the following error occurs:

Error: System.AggregateException: One or more errors occurred. (The item type must be valid for a key (Parameter 'itemType'))
---> System.ArgumentOutOfRangeException: The item type must be valid for a key (Parameter 'itemType')
at Microsoft.ML.SchemaShape.Column..ctor(String name, VectorKind vecKind, DataViewType itemType, Boolean isKey, SchemaShape annotations)
at Microsoft.ML.Transforms.ValueToKeyMappingEstimator.GetOutputSchema(SchemaShape inputSchema)
at Microsoft.ML.Data.EstimatorChain`1.GetOutputSchema(SchemaShape inputSchema)
at Microsoft.ML.Transforms.OneHotEncodingEstimator.GetOutputSchema(SchemaShape inputSchema)
at Microsoft.ML.Data.EstimatorChain`1.GetOutputSchema(SchemaShape inputSchema)
at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
at Microsoft.ML.AutoML.RegressionTrialRunner.RunAsync(TrialSettings settings, CancellationToken ct)
at Microsoft.ML.AutoML.AutoMLExperiment.RunAsync(CancellationToken ct)
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
at Microsoft.ML.AutoML.AutoMLExperiment.Run()
at Microsoft.ML.AutoML.RegressionExperiment.Execute(IDataView trainData, IDataView validationData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)
at Submission#25.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)

LittleLittleCloud commented 9 months ago

Hi, Thanks so much for the elaborated response. Your code produces the following error: Error CS0103 The name 'transformedData' does not exist in the current context

The last line is irrelevant to your case, so you can just remove that line and run example

I need to do clustering on an IDataView and then create separate IDataViews - one for each cluster, and then run ml on each of the new IDataViews.

You can use FilterByCustomPrediction. In that case you don't need to provide a complete data class, just a subset of data class definition would also works

Here's an updated version, hope this can help you

using System;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;

public class DataPoint
{
    [VectorType(2)] public float[] Features;
    public string Label;
}

public class ClusteringPrediction
{
    // I can just create a data class with the column which I want to filter on
    [ColumnName("PredictedLabel")]
    public uint SelectedClusterId;
}

var data = new[]
{
    new DataPoint { Features = new float[] { 1, 1 }, Label = "a" },
    new DataPoint { Features = new float[] { 1, 2 }, Label = "a" },
    new DataPoint { Features = new float[] { 2, 1 }, Label = "a" },
    new DataPoint { Features = new float[] { 5, 5 }, Label = "b" },
    new DataPoint { Features = new float[] { 5, 6 }, Label = "b" },
    new DataPoint { Features = new float[] { 6, 5 }, Label = "b" },
    new DataPoint { Features = new float[] { 9, 9 }, Label = "c" },
    new DataPoint { Features = new float[] { 9, 10 }, Label = "c" },
    new DataPoint { Features = new float[] { 10, 9 }, Label = "c" },
};

var context = new MLContext();

var dataView = context.Data.LoadFromEnumerable(data);

var kmeans = context.Transforms.Conversion.MapValueToKey("Label")
    .Append(context.Transforms.NormalizeMinMax("Features"))
    .Append(context.Transforms.Concatenate("Features", "Features"))
    .Append(context.Transforms.NormalizeLpNorm("Features"))
    .Append(context.Clustering.Trainers.KMeans("Features", numberOfClusters: 3))
    .Append(context.Transforms.Conversion.MapKeyToValue("Label"));

var kmeansModel = kmeans.Fit(dataView );

var clusteredData = kmeansModel.Transform(dataView );

foreach (var selectedClusterId in new []{1,2,3})
{
    var filteredData = context.Data.FilterByCustomPredicate<ClusteringPrediction>(clusteredData, (x)=> x.SelectedClusterId != selectedClusterId);
    // filteredData would be an IDataView here. You can feed it to your other trainers

    // print cluster id && count
    Console.WriteLine($"Cluster {selectedClusterId} count: {context.Data.CreateEnumerable<DataPoint>(filteredData, reuseRowObject: true).Count()}");
}

superichmann commented 9 months ago

Thanks @LittleLittleCloud It seems like we are in the good direction, but still there are problems.. After successfully filtering by the clustering result, I try to send the IDataView into an AutoML Experiment, which returns the infamous error: System.NullReferenceException: 'Object reference not set to an instance of an object.'

My Code based on iris example:

namespace IrisFlowerClustering
{  
  public class ClusteringPrediction
    {
        [ColumnName("PredictedLabel")]
        public uint SelectedClusterId;
    }
    class Program
    {
        static readonly string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "iris.data");
        static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "IrisClusteringModel.zip");

        static void Main(string[] args)
        {
    var mlContext = new MLContext(seed: 0);
    IDataView dataView = mlContext.Data.LoadFromTextFile<IrisData>(_dataPath, hasHeader: false, separatorChar: ',');
    string featuresColumnName = "Features";
    var pipeline = mlContext.Transforms
        .Concatenate(featuresColumnName, "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
        .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));
    var model = pipeline.Fit(dataView);
    var transformed = model.Transform(dataView);
    foreach (var selectedClusterId in new[] { 1, 2, 3 })
    {
IDataView filteredData = mlContext.Data.FilterByCustomPredicate<ClusteringPrediction>(transformed, (x) => x.SelectedClusterId != selectedClusterId);
        Console.WriteLine(selectedClusterId + "\t" + filteredData.Preview());
        RegressionExperimentSettings expsettings = new RegressionExperimentSettings();
        expsettings.MaxExperimentTimeInSeconds = 10;
        var exp = mlContext.Auto().CreateRegressionExperiment(expsettings);
        var result = exp.Execute(filteredData, "PetalWidth"); // !!! ** ERROR ON THIS LINE ** !!!
        Console.WriteLine(result.BestRun.ValidationMetrics.RSquared);
    }}}}

I tried your normalizers as well but it didn't help. When debbuging the type of filteredData is Microsoft.ML.Transforms.CustomMappingFilter<IrisFlowerClustering.ClusteringPrediction> and not IDataView Any suggestions?

LittleLittleCloud commented 9 months ago

which version of mlnet are you using

Get Outlook for iOShttps://aka.ms/o0ukef

From: superichmann @.> Sent: Monday, October 9, 2023 1:55:10 AM To: dotnet/machinelearning @.> Cc: Mention @.>; Comment @.>; Subscribed @.***> Subject: Re: [dotnet/machinelearning] filter by clustering result (Issue #6832)

Thanks @LittleLittleCloudhttps://github.com/LittleLittleCloud It seems like we are in the good direction, but still there are problems.. After successfully filtering by the clustering result, I try to send the IDataView into an AutoML Experiment, which returns the infamous error: System.NullReferenceException: 'Object reference not set to an instance of an object.'

My Code based on iris example:

namespace IrisFlowerClustering { public class ClusteringPrediction { [ColumnName("PredictedLabel")] public uint SelectedClusterId; } class Program { static readonly string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "iris.data"); static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "IrisClusteringModel.zip");

    static void Main(string[] args)
    {
    var mlContext = new MLContext(seed: 0);
    IDataView dataView = mlContext.Data.LoadFromTextFile<IrisData>(_dataPath, hasHeader: false, separatorChar: ',');
    string featuresColumnName = "Features";
    var pipeline = mlContext.Transforms
            .Concatenate(featuresColumnName, "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
            .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));
    var model = pipeline.Fit(dataView);
    var transformed = model.Transform(dataView);
    foreach (var selectedClusterId in new[] { 1, 2, 3 })
    {

IDataView filteredData = mlContext.Data.FilterByCustomPredicate(transformed, (x) => x.SelectedClusterId != selectedClusterId); Console.WriteLine(selectedClusterId + "\t" + filteredData.Preview()); RegressionExperimentSettings expsettings = new RegressionExperimentSettings(); expsettings.MaxExperimentTimeInSeconds = 10; var exp = mlContext.Auto().CreateRegressionExperiment(expsettings); var result = exp.Execute(filteredData, "PetalWidth"); // !!! ERROR ON THIS LINE !!! Console.WriteLine(result.BestRun.ValidationMetrics.RSquared); }}}}

I tried your normalizers as well but it didn't help. Any suggestions?

— Reply to this email directly, view it on GitHubhttps://github.com/dotnet/machinelearning/issues/6832#issuecomment-1752590622 or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEAYLOQAAG62Q4J4YP2YO3TX6O3W7BFKMF2HI4TJMJ2XIZLTSWBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVE4TEMJXHA3TCMJYURXGC3LFVFUGC427NRQWEZLMQKSXMYLMOVS2UMZXG4ZTENZZHE2TTJDOMFWWLKLIMFZV63DBMJSWZLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOKIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVEYTGMRQGIYTCNRWQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHEYDGNBXGA3DEMECUR2HS4DFUVWGCYTFNSSXMYLMOVS2SOJSGE3TQNZRGE4IFJDUPFYGLJLMMFRGK3FFOZQWY5LFVIZTONZTGI3TSOJVHGTXI4TJM5TWK4VGMNZGKYLUMU. You are receiving this email because you were mentioned.

Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

superichmann commented 9 months ago

Oh I was on not latest alpha :p updated to prerelease and now it works! Thanks @LittleLittleCloud

If I want to set the PredictedLabel as a categorical column and send the entire clustered IDataView to an AutoML Experiment how do I do that without getting an error? (as shown here)

LittleLittleCloud commented 9 months ago

Can you share the complete code (with updated PredictedLabel) here?

superichmann commented 9 months ago

I will ask something instead.

Since KMeans produce PredictedLabel and Score, after feeding it into an RegressionExperiment the final IDataView will have 3 Score columns.

In the example here I tried to add the PredictedLabel as categorical in order to remove Score from the IDataView before the experiment. but got the The item type must be valid for a key error.

My question: What would be the best way to send a clustered idataview to an experiment and how should the multiple Score columns be handled?

Will the multiple Score interrupt the scoring process after the experiment?

superichmann commented 9 months ago

Anyway I add the code for reference, based on the iris example..

using System;
using System.IO;
using Microsoft.ML;
using Microsoft.ML.AutoML;

namespace IrisFlowerClustering
{
    class Program
    {
        static readonly string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "iris.data");

        static void Main(string[] args)
        {
            var mlContext = new MLContext(1337);

            var expSettings = new RegressionExperimentSettings();
            expSettings.MaxModels = 1;
            var exp = mlContext.Auto().CreateRegressionExperiment(expSettings);
            ColumnInformation CI = new ColumnInformation();
            CI.LabelColumnName = "PetalWidth";
            string featuresColumnName = "Features";
            IDataView dataView = mlContext.Data.LoadFromTextFile<IrisData>(_dataPath, hasHeader: false, separatorChar: ',');
            foreach (var x in dataView.Schema)
            {
                Console.WriteLine(x.Name);
            }
            var result = exp.Execute(dataView, "PetalWidth");
            Console.WriteLine(result.BestRun.ValidationMetrics.RSquared);
            var pipeline1 = mlContext.Transforms.Concatenate(featuresColumnName, "SepalLength", "SepalWidth", "PetalLength");
            var concatenated = pipeline1.Fit(dataView).Transform(dataView);

            var result2 = exp.Execute(concatenated, CI);
            Console.WriteLine(result2.BestRun.ValidationMetrics.RSquared);

            var pipeline = mlContext.Transforms
                .Concatenate(featuresColumnName, "SepalLength", "SepalWidth", "PetalLength")
                .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));
            var model = pipeline.Fit(dataView);
            var clusteredIDataView = model.Transform(dataView);

            foreach (var x in clusteredIDataView.Schema)
            {
                Console.WriteLine(x.Name);
            }

            //var pipelineRemove = mlContext.Transforms.DropColumns(new[] { "Score", "PredictedLabel","Features" });
            //clusteredIDataView = pipelineRemove.Fit(clusteredIDataView).Transform(clusteredIDataView);
            //CI.IgnoredColumnNames.Add("Score");
            //CI.IgnoredColumnNames.Add("PredictedLabel");
            //CI.IgnoredColumnNames.Add("Features");
            var result3 = exp.Execute(clusteredIDataView, CI);
            Console.WriteLine(result3.BestRun.ValidationMetrics.RSquared);

            var transformed2 = result3.BestRun.Model.Transform(clusteredIDataView);
            foreach (var x in transformed2.Schema)
            {
                Console.WriteLine(x.Name); // 3 Score columns
            }

            var resultCheck = mlContext.Regression.Evaluate(transformed2, "PetalWidth", "Score"); //  can we trust this evaluate? // different score then the experiment? since splitting?
            Console.WriteLine(resultCheck.RSquared.ToString());
        }
    }
}

superichmann commented 8 months ago

@LittleLittleCloud can you please see last two messages?

& do you know how can I use FilterByCustomPredicate with a column name that I only know at runtime?

LittleLittleCloud commented 8 months ago

Sorry for the late reply, it was buried in my notification list..

My question: What would be the best way to send a clustered idataview to an experiment and how should the multiple Score columns be handled?

To avoid having multiple score column, the simplest way on my point of view is to set up different score / predict label for different trainer. For example, for k_means you can set the score column name to k_mean_score, and for regression, regression_score. etc..

Will the multiple Score interrupt the scoring process after the experiment?

If multiple score column provided, the latest update of score value will be used.

superichmann commented 8 months ago

@LittleLittleCloud thanks again. how do I one hot encode the clustering result to be used as a categorical in regression?

LittleLittleCloud commented 8 months ago

@superichmann You can just concatenate the cluster result (which should be of type key) to the feature column. One hot encode transforms a column to key type, which in your case probably won't be necessary

dotnet / machinelearning

filter by clustering result #6832