dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

More Parameterized Column Names for .CrossValidate() #5621

Open caitisgreat opened 3 years ago

caitisgreat commented 3 years ago

System information

.NET SDK (reflecting any global.json): Version: 5.0.100 Commit: 5044b93829

Runtime Environment: OS Name: Windows OS Version: 10.0.18363 OS Platform: Windows RID: win10-x64 Base Path: C:\Program Files\dotnet\sdk\5.0.100\

Host (useful for support): Version: 5.0.0 Commit: cf258a14b7

.NET SDKs installed: 3.1.200 [C:\Program Files\dotnet\sdk] 5.0.100 [C:\Program Files\dotnet\sdk]

.NET runtimes installed: Microsoft.AspNetCore.All 2.1.16 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.All] Microsoft.AspNetCore.App 2.1.16 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 3.1.2 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.AspNetCore.App 5.0.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App] Microsoft.NETCore.App 2.1.16 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 3.1.2 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.NETCore.App 5.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App] Microsoft.WindowsDesktop.App 3.1.2 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App] Microsoft.WindowsDesktop.App 5.0.0 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

ML.NET Package Version: v1.5.2

Request

Would it be possible to include the same parameterized column names from the Evaluate method (Multiclass/Binary Classifiers) in the CrossValidate method? I'm performing a bunch of column manipulation in order to distinguish elements in a sequenced ML pipeline (performing multiclass classification then sentiment analysis)

Source code / logs

image

justinormont commented 3 years ago

The additions seem reasonable.

The request is to add scoreColumnName and predictedLabelColumnName to the Crossvalidate API. These would control the name of the output columns when one of the crossval fold's models are run.

Work around You can use the CopyColumns transform (docs) to give the output Score and PredictedLabel columns new/unique names. The runtime/memory cost is negligible.

Most often I use a Concatenate transform for this purpose (example), though it has the side-effect of upgrading the column to a vector type.

Side note: unless you want the metrics from the cross validation run, you're better off just fitting your model on the full dataset.

Side task for ML․NET We may want to add a full training run within the cross-validation.

Currently for 5-fold cross-validation, we run five folds of 80/20 train/validate; each returned model uses only 80% of the data. A user then has to pick one of these five models to use. We can run one extra model to return a final model fit on 100%. This allows a user to both get the metrics from CV, and have a better fit model.