Closed rogancarr closed 5 years ago
Do we have any idea what should be new name?
@Ivanidzo4ka good question! In the above, I've made a suggestion for "IdColumn".
Sorry, I guess you mention it in other issue, don't see it here.
IdColumn
feels blank and also doesn't reflects purpose of it.
maybe ConsistencyColumn
or RetentionColumn
How about RowGroupPreservationColumn
? GroupPreservationColumn
? PreservationColumn
?
RowSetPreservationColumn
? Super explicit, and doesn't use the word "group".
Row Set Preservation Society. That would be good name for my second album.
GroupPreservationColumn
sound best for me, but would be nice to ask other people around
If I heard something was renamed to IdColumn
, I would assume it was the Name
column.
Is there another industry term for this? We can't be the first.
Closest I see in scikit-learn is GroupShuffleSplit
. Perhaps SplitGroup
?
https://scikit-learn.org/stable/modules/cross_validation.html#group-shuffle-split
Another route is to rename the Group
column to RankingGroup
, which then frees up Stratification
to move to Group
(which seems to be the industry term).
Speaking of renaming. @Dmitry-A was saying earlier today that Name
may be better called RowID
public TrainTestData TrainTestSplit(IDataView data, double testFraction = 0.1, string stratificationColumn = null, uint? seed = null)
public CrossValidationResult<CalibratedBinaryClassificationMetrics>[] CrossValidate( IDataView data, IEstimator<ITransformer> estimator, int numFolds = 5, string labelColumn = DefaultColumnNames.Label,string stratificationColumn = null, uint? seed = null)
@justinormont what Name
are you talking about?
The column purpose of Name
, which allows a user to identify the row of data. It's mainly used for debugging as it's printed to the .inst.txt
file. It lets you match the input data row to the output score.
I'm unsure we have brought the concept to ML.NET.
Ah, that Name
. Do we even expose it anywhere in Ml.Net? It's probably part of some commands, but I don't think we do anything with commands right now, since they all hidden
I see it listed here: https://github.com/dotnet/machinelearning/blob/0c62e30b4d9eabb60322b2a3e75bc90e20007889/src/Microsoft.ML.Data/Commands/DefaultColumnNames.cs#L12
No idea if we utilize the concept though.
Let's keep this discussion on potential names for StratificationColumn
. Any other naming issues, please open a separate issue. (Sorry to be strict, but I need to drive this to conclusion.)
So far we have
IdColumn
: Too vague
Group
: Group and relatives feels to rank-y to some folks, but is industry standard language.
RowGroupPreservationColumn
GroupPreservationColumn
RowSetPreservationColumn
ConsistencyColumn
RetentionColumn
@TomFinley @shauheen @glebuk @yaeldekel Any thoughts?
I renamed it to GroupPreservationColumn
in : https://github.com/dotnet/machinelearning/pull/2537
By itself not an acceptable name. If you somehow clarified the "group" column to mean something else. @justinormont 's suggestion of RankingGroup
is not my favorite since we use this in other contexts other than ranking (albeit lower priority ones that haven't yet been migrated to the open source codebase).
Anyway, sklearn
gets away with it there because it's very, very clear in context what "group" it's talking about since you're calling GroupShuffleSplit
. If you were to just identify something divorced from that context and just call it a "group," then by itself is it clear what it's talking about? Not at all.
This is the problem, is that what type of "group" is considered relevant are vert context dependent. If you can make a case that "group" is used in other contexts to refer to this specifically, I could change my mind potentially. But as far as I see the case depends on a 5 character substring of a method from Python taken compeltely out of the context that made it clear what type of group you were talking about.
Maybe RowGroup
column for what we now call a Group
column, and SplitGroup
or SplittingGroup
column for what we call stratification. If we don't have to the stomach to rename "group" column at this time, which I could understand, maybe just call it SplitColumn
. That suggests clearly enough to me that this has something to do with when a dataset is split, and I think we can easily explain it.
I like @TomFinley naming suggestions:
Group
=> RowGroup
Stratification
=> SplitGroup
(or SplittingGroup
/SplitColumn
)Would be nice to make that names consistent as well. At least last one.
CrossValidation
andTrainTestSplit
have a parameter calledStratificationColumn
that is used to preserve groupings of columns across splits (as discussed in #2487). This isn't actually stratification, so we should rename the column.This is a forked sub-issue from #2487
Related to #1204