Cross-Validation API for v1.0

rogancarr commented 5 years ago

Hi All,

As we approach v1.0, I thought it might be nice to look at the API for cross-validation. Currently, our cross-validation API takes the inputs:

IDataView data; // training data
IEstimator<ITransformer> estimator; // Model to fit
int numFolds; //Number of folds to make
string labelColumn; // The label
string stratificationColumn; // The column to stratify on
seed; // The seed

and returns an array of

RegressionMetrics metrics;
ITransformer model;
IDataView scoredTestData;

with one entry for each fold.

I have a few questions:

1) Are we happy with the outputs? I'm not overly concerned with these, but it will be hard to make this list smaller as we go. 2) Do we need to specify labelColumn? Isn't there a way to get the label from the model? Making this explicit means that we are allowing the learner and the CV metrics to utilize different labels. 3) Are we using the right terminology for stratification? Stratification usually means that ratios of classes are maintained across splits (see stratified sampling on wikipedia). Here, stratification means that items with the same value are clumped into the same split. The former makes sense if you want to maintain class ratios, especially with highly imbalanced classes, while the latter is useful for things like ranking (e.g. groupIds) or where leakage due to something like ordering may be a concern.

rogancarr commented 5 years ago

For item no. 3, I suggest the following:

Instead of Stratification, call it something like Grouping or RetainGroups, so CV would have the option GroupingColumn or RetainGroupsColumn.
Add support for ratio-balancing "stratification" in v1.1.

rogancarr commented 5 years ago

@shauheen @glebuk @TomFinley What do you think?

justinormont commented 5 years ago

For #1:

I'd like to see aggregate metrics returned.
Is scoredTestData lazy? Or cached from the metrics calculation? If cached, this reduces our scalability.

Another question:

Can we validate values in the `Stratification` column?
It's too easy for users to shoot themselves in the foot. -- also in v1.1

justinormont commented 5 years ago

More details for my added #4...

We should emit an error/warning when the user tries to use CV w/ a SamplingKeyColumn which includes numbers not between 0 & 1. I ran into the issue w/ a ranking dataset.

I think our current handling per datatype is:

text : Run thru hashing, then evenly divide the hashspace [0...2^hashbits] into the folds of [0...2^hashbits/k), ...
boolean : unknown to me
I4 : unknown to me
key : Evenly divide the key space [0..N] in to the folds of [0, N/k), [n/k, 2*N/k), ...
- Downside here, is if this was generated from a term transform on a column. **the semi-bad part** The larger/more-common groups are found sooner, and get placed as lower numbers in the key space, whereas uncommon groups are found last and are assigned the last key values. The user gets no warning of the produced imbalanced folds.
R4 : Blindly assume the space is uniformly distributed over [0..1] **the bad part**
- Downside here, is if the value is >1. The user gets no warning that all rows will go into one CV fold.

We may also want to warn when the CV fold sizes are rather unbalanced, or worse when one fold has no data.

Update: The Stratification column has been renamed to SamplingKeyColumn; fixed comment above with new terminology.

TomFinley commented 5 years ago

More specifically, perhaps, let's consider the actual signature:

https://github.com/dotnet/machinelearning/blob/b863ac2458f486834a9da3ecdd5a69f51f99db73/src/Microsoft.ML.Data/TrainCatalog.cs#L514-L516

An easily fixed problem is that the transformer type is not generic on the type of transformer returned. This has a practical effect... let's imagine you feed in SDCA or AP... surprise! You won't be able to know that the result is a linear model, because this method was not generic on the transformer type. If it had a generic parameter TTransformer with the restriction where TTransformer : ITransformer, then it would be right in the type, you'd get sensible intellisense, etc. (Practically this type will be auto after to be inferred I believe.) (Note that this also means that the return value will be generic on that same type.)

The most serious problem is the return value. 😄 Returning an array of value-tuples is not acceptable. Consider this comment from @justinormont ...

I'd like to see aggregate metrics returned.

Now, yeah, sure, we won't get around to this for v1, but by this being an array, it would be impossible to add it as post v1. Rather, the return value should be some single object out of which you can get the per fold results. We can always add more things (like aggregate metrics) to the object later, but if it is an array, we simply cannot.

Further, those per fold results should not be a value-tuple, for precisely the same reason that it is impossible to add more to them later without it being a breaking change.

(More generally: someone should followup and make sure we don't use value-tuples in our public surface of our API. They are currently quite problematic for F# and the tooling in VS is currently poor.)

Note that there are multiple of these methods, usually one per each of the major training catalogs. You mentioned the regression one.

Ivanidzo4ka commented 5 years ago

The most serious problem is the return value. 😄

Considering what we are heavily investing in documentation right now. We currently putting this code in our samples.

// Leave out 10% of data for testing
            var (trainData, testData) = mlContext.BinaryClassification.TrainTestSplit(data, testFraction: 0.1);

I don't mind to go through code and clean this stuff, but first we need to agree what we need to do

Shall we have

public struct TrainTestSplit
        {
            public readonly IDataView TrainSet;
            public readonly IDataView TestSet;
        }

or we prefer other options?

rogancarr commented 5 years ago

@Ivanidzo4ka Those look good to me.

rogancarr commented 5 years ago

@TomFinley What about Stratification? That parameter name is super confusing for me, as this isn't usually what people mean when they say "stratification".

TomFinley commented 5 years ago

Stratification as you point out is a bad name. Indeed what we do is somewhat the opposite of stratification, in a way. The suggestion of "group" though by itself is not the best, since we already use that to identify actual groups.

So what we're trying to do with this is identify sets of items where, if they were to be split into the training and test set, would represent a form of label leakage. (Certainly groups as we see them in ranker training represents one important case of this.) So maybe "co-dependency ID." 😄 No but seriously, not that, though I agree it should be renamed @rogancarr.

rogancarr commented 5 years ago

Okay, now that PR #2501 and #2536 are in, we just have one last question:

2. Do we need to specify labelColumn? Isn't there a way to get the label from the model? Making this explicit means that we are allowing the learner and the CV metrics to utilize different labels.

@TomFinley @Ivanidzo4ka Any thoughts on this last one?

dotnet / machinelearning

Cross-Validation API for v1.0 #2487