dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.94k stars 1.86k forks source link

CV + SamplingKey: How to drop from being as a training feature? #6786

Closed torronen closed 11 months ago

torronen commented 11 months ago

I am using cross-validation with a sampling key, because I know the sampling key column is a shortcut and will lead to overfitting.

I can also observe this by having a look at the global feature index of the trained model: Sampling.1 1 Sampling.4 0.85616875

My challenge is now how do I drop the sampling key column from being used as a feature to train on?

Simplified code below:

  experiment
                .SetPipeline(pipeline)
                .SetRegressionMetric(metric, labelColumn: columnInference.ColumnInformation.LabelColumnName)
                .SetTrainingTimeInSeconds(trainingTimeSeconds)
                .SetDataset(data, fold: NumFolds, samplingKeyColumnName: SamplingKeyColumn)
                .SetCostFrugalTuner();

            TrialResult experimentResults = await experiment.RunAsync(cts.Token);

If I drop as part of pipeline, I will get error "failed with exception Could not find input column 'Sampling' (Parameter 'inputSchema')"

Removing sampling key from columninformation did not help

 //Remove sampling column key
 columnInference.ColumnInformation.CategoricalColumnNames.Remove(SamplingKeyColumn);
torronen commented 11 months ago

Maybe, as a last resort samplingkey column maybe could be dropped here https://github.com/dotnet/machinelearning/blob/077a6b81966dc2c514572568917f36cb94e08ac4/src/Microsoft.ML.Data/TrainCatalog.cs#L104C1-L105C59 ?

It is the crossvalidate method so it would affect all existing implementations, but I do not think it should break anything. It would likely only improve the AutoML results. Are there cases where one would like to use sampling key for CV but still use it for training? I believe in nearly most cases, no, but are there special cases that should still be supported?

In either case, I think having a way to drop the sampling key column for training while crossvalidating would be very important.

torronen commented 11 months ago

The code currently creates a splitting column, and then drops it, but keeps Sampling Key column. I would argue, in many cases ay least, it should also be dropped. User can not do it at the moment in Cross-validation, because it it called from CrossValidateTrain method. For TrainTestSplit use can do it after calling the split.

https://github.com/dotnet/machinelearning/blob/077a6b81966dc2c514572568917f36cb94e08ac4/src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs#L434

https://github.com/dotnet/machinelearning/blob/077a6b81966dc2c514572568917f36cb94e08ac4/src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs#L500C25-L500C25

torronen commented 11 months ago

I am thinking of something like this: https://github.com/torronen/machinelearning/commit/076fd263541be48601b8ee49bbb45446399ba300 Main concern is if here are cases where sampling key column should be kept for training, and if so, should there be a switch to allow dropping of the sampling key column?

If so, maybe here?

 experiment
                .SetPipeline(pipeline)
                .SetRegressionMetric(metric, labelColumn: columnInference.ColumnInformation.LabelColumnName)
                .SetTrainingTimeInSeconds(trainingTimeSeconds)
                .SetDataset(data, fold: NumFolds, samplingKeyColumnName: SamplingKeyColumn, 
                       **dropSamplingKeyColumnAfterSplit: true**)
                .SetCostFrugalTuner();
LittleLittleCloud commented 11 months ago

I'm not quite follow about the training feature here. Does the training feature means all the columns from a dataset except label column, or just "FeatureColumn" that is passed to a trainer? If the training feature means the "FeatureColumn", maybe you can just not concatenate "SamplingKey" column with "FeatureColumn"?

torronen commented 11 months ago

I think this was possible using the Ignored Column (I did not read the code, but just observed it is no longer shown in global feature importance list)

So, the solution is:

 //Remove sampling column key
 columnInference.ColumnInformation.IgnoredColumns.Remove(SamplingKeyColumn);
 ....
  .Append(ctx.Auto().Featurizer(data, columnInformation: columnInference.ColumnInformation))

@LittleLittleCloud yes, you are correct. I think setting it as ignored is now doing what you proposed. It is kept in data variable to make it possible to use the sampling key for splitting, but not concatenated to "FeatureColumn" due to being in IgnoredColumns. I just did not remember there was this ignored columns property.

Thank you!