Getting unablanced (or empty) folds when running CV with a SamplingKeyColumn (such as when running CV with Ranking)

dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.

https://dot.net/ml

MIT License

9.04k stars 1.89k forks source link

Getting unablanced (or empty) folds when running CV with a SamplingKeyColumn (such as when running CV with Ranking) #3711

Open yaeldekel opened 5 years ago

yaeldekel commented 5 years ago

The change in KeyDataViewType to disallow unknown cardinality keys, fixed a silent bug in CV that caused the group column to not be used for stratification (causing label leakage between the training and test sets), but now that it is fixed there is a new bug. CV now uses the GroupId column for stratification, but the values in the GroupId column may not be distributed evenly, causing all the examples to be in one fold, and the others would be empty. For example: Assume that the values in the GroupId column are 10,11,...,19, and that they are loaded with TextLoader as a key type column. The key type would be of cardinality 20, and if there are 2 folds, then CV will try to create one fold with all the examples with values 0,...,9 and the other fold with all the examples with values 10,...,19.

codemzs commented 5 years ago

change to P1 as per @yaeldekel 's recommendation since this is specific to TLC.

antoniovs1029 commented 4 years ago

I believe this problem isn't particular of Ranking and CV with a GroupIdcolumn, but in general of using a KeyDataViewType column as a SamplingKeyColumnwhen doing both TrainTestSplit or CrossValidation, and only in the case that the KDVT column has values that aren't evenly distributed across its range.

But in what cases would this occur? You point out to this case:

Assume that the values in the GroupId column are 10,11,...,19, and that they are loaded with TextLoader as a key type column

My understanding so far is that we can't actually load a column as KeyDataViewTypeusing the TextLoader. I believe we can only specify the DataKindof a column using these DataKinds (which include unsigned integers but not actually KeyDataViewType). So we could load the GroupId column you mention as unsigned integer, but then to actually make it a Key type column, we'll need to explicitly add a ValueToKeyTransformer (or a OneHotEncoding, or a HashingTransformer, etc..) and that will make the range to be 0-10 instead of 0-20, and so the keys will be balanced when doing the splits. Please, correct me if I'm wrong.

If I am right about this, is there any other case were we actually allow the user to create a KeyDataViewType column with values that are "unbalanced"? Using a ValueToKey/Hashing/OneHotEncoding transformer on top of any given column would result in keys that are evenly distributed right?

The only other way I can think of (which I'm not even sure if it's possible, but I think it would), is if the user loads a column and then applies a TypeConvertingTransformer to get a KeyDataViewType. I'm not sure about all the standard conversions we have, but I guess that converting an unsigned integer column into a KDVT in this manner could produce the problem you describe (if the original values aren't distributed evenly and they use this column as SamplingKeyColumn for their splits). But, albeit this would still be somewhat a bug, the simple workaround would be to suggest the user to use a ValueToKeyTransformer or a HashingTransformerinstead of a TypeConverting one.

Are there other ways to get a KDVT with the characteristics you've mentioned?

Lynx1820 commented 4 years ago

@justinormont also brought up that queryIds in ranking are often reused, and thus popular queries will often be found earlier on the dataset, making the first fold over-sized, so the better approach is to hash the groupId.

yaeldekel commented 4 years ago

@antoniovs1029 , the API for loading key types directly in text loader is probably not very commonly used, but it does exist: https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs#L73-L83

Query IDs are often multi digit numbers all in a certain range, which does not necessarily begin with 0. In previous, very old versions of ML.NET, the KeyDataViewType had not only a KeyCount, but it actually had a range, so you could define a key type with count=10, but values between 458 and 467 for example. I don't know how common this case is, but it might become more common if AutoML starts supporting ranking. Either way, I guess fixing this issue can be postponed until there is an actual use case where this is a problem.

@Lynx1820 , I am not sure I understand why hashing the groupId would help in this case - if two queries have the same queryId then the hash value would also be the same, wouldn't it?

antoniovs1029 commented 4 years ago

Thanks for your reply @yaeldekel . I wasn't aware of that possibility for TextLoader to also load directly KeyDataViewTypes. Good to know about it. I also don't know how common the usage of it is, so I'd agree that we can postpone fixing this issue until/if an actual user hits it.

Don't know if that TextLoader API is being used by the new AutoML API for Ranking. @Lynx1820 ? Maybe this wouldn't be much of an issue there?

Lynx1820 commented 4 years ago

@yaeldekel Assuming that queryIds are reused and thus more popular queryIds get found earlier on the dataset, the most popular queryIds will be mapped to lower key values. Thus, hashing would redistribute the queryIds with large sets of queries. Like you mentioned, not the actual queries though.

The ranking experiment just added to AutoML does hash the queryIds, so I don't believe we would encountered this issue there.

yaeldekel commented 4 years ago

I see. Thanks for the clarification @Lynx1820. In this case, hashing will make the first fold not be over-sized, but there will still be some fold (perhaps not the first) that is over-sized.

Using hashing as opposed to value-to-key is preferable for another reason, which is relevant for the actual group Id column, not for the column used for CV: If you use value-to-key, and include that transformer in the model, then when you test the model on new data, with query Ids that were not seen in the training set, then their group Ids will all be mapped to 0 making the evaluator output wrong results (since all the queries that were not part of the training set will be considered as one group).

yaeldekel commented 4 years ago

By the way, if I remember correctly, I think that LightGBM or FastTreeRanking (I don't remember which), checks that a group Id that has already been seen, is not seen again later in the dataset, so reusing query Ids in the same dataset might not work - has this been tested?

antoniovs1029 commented 4 years ago

Hi, @yaeldekel

By the way, if I remember correctly, I think that LightGBM or FastTreeRanking (I don't remember which), checks that a group Id that has already been seen, is not seen again later in the dataset, so reusing query Ids in the same dataset might not work - has this been tested?

I'm not sure what you're referring to. Perhaps it's about the thing I noticed here: https://github.com/dotnet/machinelearning/issues/5022#issuecomment-630577977 (on the "Finally, ..." paragraph) and that was later confirmed by @justinormont here: https://github.com/dotnet/machinelearning/issues/5022#issuecomment-630588314 , where when training a LightGBM ranking model, it will group the samples with the same GroupId only if they appear contiguously in the dataset... if they don't, we'll pretty much change the GroupId to another one, in order to comply with LightGBM's requirement of needing GroupIds to be ordered in blocks (i.e. all samples with GroupId "1" should be contiguous, all with GroupId "2" should also be contiguous, and so on... this requirement is from the LightGBM library, and so our wrapper is only complying with it, and also for the reason raised by justin in the comment I've linked to). And so this doesn't throw exceptions, or make LightGBM to ignore samples, but it can output a not-so-correct model in some cases.

I'm not sure how that could be affected (or not) by the issue described here in #3711 .

antoniovs1029 commented 4 years ago

So a couple of NimbusML tests are failing because of this issue (#3711), even if we're hashing the GroupId column.

What happens is that the failing NimbusML tests use a dataset which has only 8 unique GroupIds, and runs Cross Validation with 5 folds using the GroupId string column as sampling key column (a.k.a "stratification column"). The problem is that the returned 2nd fold has an empty test dataset, and this causes problems in the test (which expects data in all the test sets of every fold).

The reason for this is that when running CV if we have a string column as samplingKeyColumn we'll hash it, and then add a RangeFilter on top of it, where we divide the range of the filter to be of size 1/number_of_folds. Then the RangeFilter will take the hashed GroupdId, divide it by the number of possible hashes (which is the cardinality of the hashed key column, i.e. 1073741824 for 31-bit hashes such as the one used on this case) and decide if the sample makes it to the filtered dataview (link to code).

So, for example, let's say that we have the string "GroupA" in the GroupId column, and that it hashes to 512345678. Then the range filter will assign it a value of 512345678/1073741824 = 0.477. If we ran CV with 5 folds, then the RangeFilter will put the samples with 0-.2 inside the test set of the first fold, .2-.4 in the second fold, .4-.6 in the third fold and so on... Since "GroupA" was assigned 0.477, then it will be in the test set of the third fold.

Going back to the NimbusML tests, they fail because none of the 8 unique GroupIds is hashed into a value that will end up in the test set of the 2nd fold. If my math is correct, the probability of ending with the 2nd fold with an empty testset is (4/5)^8 = 0.1677, so these tests had a 16% chance of failing with the empty 2nd fold.

I guess the general advice for this issue, until we get to actually find a way to fix it, is to simply reduce the number of folds to reduce the probability of ending with empty test sets. E.g., in the scenario I've mentioned, lowering the number of folds to 4 makes the problem go away (no empty test set). I'm still unsure on how to calculate the probability of getting at least one empty set given the number of unique GroupIds and number of folds, so I'm unable to give a general rule of thumb to users so that they can avoid this issue (given that they now the number of unique GroupIds).

Anyway, I want to leave this for the record, only so that we know that this is still a problem even if we prehash the column (cc: @Lynx1820 , does the new Ranking API for AutoML runs Cross Validation using the GroupId as samplingKeyColumn? Would there be problems/exceptions if there's an empty fold? If so, we might want to keep this issue in mind, in case it causes troubles in the future).

antoniovs1029 commented 4 years ago

For the record,

As discussed offline, I proposed a solution to this problem that would involve redesigning how we split data on TrainTestSplit, CVSplit (and the other related methods), to not use a RangeFilter, but use another Filter (perhaps needing to create a whole new Filter) that would actually keep track of how many samples are assigned to each fold, so to make the folds balanced while still respecting the SamplingKeyColumn (aka Stratification column, i.e. GroupId on Ranking tasks)

Still, as discussed with @harishsk , I'll change this issue from P1 to P2, albeit it's a valid bug, since we are unsure of how much impact it will have on users (even with the new Ranking CV API and Ranking API for AutoML). Even more, there are simple workarounds for this issue, namely:

If they're getting this issue because the dataset has few unique GroupIds (as describe on my previous comment), then simply lower the number of folds (this will avoid empty folds, and not necessarily produced balanced folds... but I guess it's better than getting empty folds)
If they're getting this issue because they're loading the data as @yaeldekel described on her original post, then simply avoid loading the data in that manner, and load the column as Int and then add a ValueToKey transformer on top of the column.