dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.89k forks source link

StratificationColumn in CrossValidation and TrainTestSplit #2536

Closed rogancarr closed 5 years ago

rogancarr commented 5 years ago

CrossValidation and TrainTestSplit have a parameter called StratificationColumn that is used to preserve groupings of columns across splits (as discussed in #2487). This isn't actually stratification, so we should rename the column.

This is a forked sub-issue from #2487

Related to #1204

Ivanidzo4ka commented 5 years ago

Do we have any idea what should be new name?

rogancarr commented 5 years ago

@Ivanidzo4ka good question! In the above, I've made a suggestion for "IdColumn".

Ivanidzo4ka commented 5 years ago

Sorry, I guess you mention it in other issue, don't see it here. IdColumn feels blank and also doesn't reflects purpose of it. maybe ConsistencyColumn or RetentionColumn

rogancarr commented 5 years ago

How about RowGroupPreservationColumn? GroupPreservationColumn? PreservationColumn?

RowSetPreservationColumn? Super explicit, and doesn't use the word "group".

Ivanidzo4ka commented 5 years ago

Row Set Preservation Society. That would be good name for my second album. GroupPreservationColumn sound best for me, but would be nice to ask other people around

justinormont commented 5 years ago

If I heard something was renamed to IdColumn, I would assume it was the Name column.

Is there another industry term for this? We can't be the first.

justinormont commented 5 years ago

Closest I see in scikit-learn is GroupShuffleSplit. Perhaps SplitGroup?

https://scikit-learn.org/stable/modules/cross_validation.html#group-shuffle-split image

Another route is to rename the Group column to RankingGroup, which then frees up Stratification to move to Group (which seems to be the industry term).

justinormont commented 5 years ago

Speaking of renaming. @Dmitry-A was saying earlier today that Name may be better called RowID

Ivanidzo4ka commented 5 years ago
public TrainTestData TrainTestSplit(IDataView data, double testFraction = 0.1, string stratificationColumn = null, uint? seed = null)
public CrossValidationResult<CalibratedBinaryClassificationMetrics>[] CrossValidate( IDataView data, IEstimator<ITransformer> estimator, int numFolds = 5, string labelColumn = DefaultColumnNames.Label,string stratificationColumn = null, uint? seed = null)

@justinormont what Name are you talking about?

justinormont commented 5 years ago

The column purpose of Name, which allows a user to identify the row of data. It's mainly used for debugging as it's printed to the .inst.txt file. It lets you match the input data row to the output score.

I'm unsure we have brought the concept to ML.NET.

Ivanidzo4ka commented 5 years ago

Ah, that Name. Do we even expose it anywhere in Ml.Net? It's probably part of some commands, but I don't think we do anything with commands right now, since they all hidden

justinormont commented 5 years ago

I see it listed here: https://github.com/dotnet/machinelearning/blob/0c62e30b4d9eabb60322b2a3e75bc90e20007889/src/Microsoft.ML.Data/Commands/DefaultColumnNames.cs#L12

No idea if we utilize the concept though.

rogancarr commented 5 years ago

Let's keep this discussion on potential names for StratificationColumn. Any other naming issues, please open a separate issue. (Sorry to be strict, but I need to drive this to conclusion.)

rogancarr commented 5 years ago

So far we have

IdColumn: Too vague Group: Group and relatives feels to rank-y to some folks, but is industry standard language. RowGroupPreservationColumn GroupPreservationColumn

RowSetPreservationColumn ConsistencyColumn RetentionColumn

@TomFinley @shauheen @glebuk @yaeldekel Any thoughts?

rogancarr commented 5 years ago

I renamed it to GroupPreservationColumn in : https://github.com/dotnet/machinelearning/pull/2537

TomFinley commented 5 years ago

By itself not an acceptable name. If you somehow clarified the "group" column to mean something else. @justinormont 's suggestion of RankingGroup is not my favorite since we use this in other contexts other than ranking (albeit lower priority ones that haven't yet been migrated to the open source codebase).

Anyway, sklearn gets away with it there because it's very, very clear in context what "group" it's talking about since you're calling GroupShuffleSplit. If you were to just identify something divorced from that context and just call it a "group," then by itself is it clear what it's talking about? Not at all.

This is the problem, is that what type of "group" is considered relevant are vert context dependent. If you can make a case that "group" is used in other contexts to refer to this specifically, I could change my mind potentially. But as far as I see the case depends on a 5 character substring of a method from Python taken compeltely out of the context that made it clear what type of group you were talking about.

Maybe RowGroup column for what we now call a Group column, and SplitGroup or SplittingGroup column for what we call stratification. If we don't have to the stomach to rename "group" column at this time, which I could understand, maybe just call it SplitColumn. That suggests clearly enough to me that this has something to do with when a dataset is split, and I think we can easily explain it.

justinormont commented 5 years ago

I like @TomFinley naming suggestions:

Ivanidzo4ka commented 5 years ago

https://github.com/dotnet/machinelearning/blob/3b9d407d9dc4f8c46fa85ab80575ef16d74df6df/src/Microsoft.ML.Data/TrainCatalog.cs#L208

https://github.com/dotnet/machinelearning/blob/3b9d407d9dc4f8c46fa85ab80575ef16d74df6df/src/Microsoft.ML.Data/TrainCatalog.cs#L214

Would be nice to make that names consistent as well. At least last one.