Open yaeldekel opened 5 years ago
change to P1 as per @yaeldekel 's recommendation since this is specific to TLC.
I believe this problem isn't particular of Ranking and CV with a GroupId
column, but in general of using a KeyDataViewType
column as a SamplingKeyColumn
when doing both TrainTestSplit
or CrossValidation
, and only in the case that the KDVT
column has values that aren't evenly distributed across its range.
But in what cases would this occur? You point out to this case:
Assume that the values in the GroupId column are 10,11,...,19, and that they are loaded with TextLoader as a key type column
My understanding so far is that we can't actually load a column as KeyDataViewType
using the TextLoader
. I believe we can only specify the DataKind
of a column using these DataKinds (which include unsigned integers but not actually KeyDataViewType
). So we could load the GroupId column you mention as unsigned integer, but then to actually make it a Key type column, we'll need to explicitly add a ValueToKeyTransformer
(or a OneHotEncoding, or a HashingTransformer, etc..) and that will make the range to be 0-10 instead of 0-20, and so the keys will be balanced when doing the splits. Please, correct me if I'm wrong.
If I am right about this, is there any other case were we actually allow the user to create a KeyDataViewType
column with values that are "unbalanced"? Using a ValueToKey/Hashing/OneHotEncoding transformer on top of any given column would result in keys that are evenly distributed right?
The only other way I can think of (which I'm not even sure if it's possible, but I think it would), is if the user loads a column and then applies a TypeConvertingTransformer
to get a KeyDataViewType
. I'm not sure about all the standard conversions we have, but I guess that converting an unsigned integer column into a KDVT
in this manner could produce the problem you describe (if the original values aren't distributed evenly and they use this column as SamplingKeyColumn
for their splits). But, albeit this would still be somewhat a bug, the simple workaround would be to suggest the user to use a ValueToKeyTransformer
or a HashingTransformer
instead of a TypeConverting
one.
Are there other ways to get a KDVT
with the characteristics you've mentioned?
@justinormont also brought up that queryIds in ranking are often reused, and thus popular queries will often be found earlier on the dataset, making the first fold over-sized, so the better approach is to hash the groupId
.
@antoniovs1029 , the API for loading key types directly in text loader is probably not very commonly used, but it does exist: https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs#L73-L83
Query IDs are often multi digit numbers all in a certain range, which does not necessarily begin with 0. In previous, very old versions of ML.NET, the KeyDataViewType
had not only a KeyCount
, but it actually had a range, so you could define a key type with count=10, but values between 458 and 467 for example. I don't know how common this case is, but it might become more common if AutoML starts supporting ranking. Either way, I guess fixing this issue can be postponed until there is an actual use case where this is a problem.
@Lynx1820 , I am not sure I understand why hashing the groupId
would help in this case - if two queries have the same queryId then the hash value would also be the same, wouldn't it?
Thanks for your reply @yaeldekel . I wasn't aware of that possibility for TextLoader
to also load directly KeyDataViewTypes
. Good to know about it. I also don't know how common the usage of it is, so I'd agree that we can postpone fixing this issue until/if an actual user hits it.
Don't know if that TextLoader API is being used by the new AutoML API for Ranking. @Lynx1820 ? Maybe this wouldn't be much of an issue there?
@yaeldekel Assuming that queryIds
are reused and thus more popular queryIds
get found earlier on the dataset, the most popular queryIds
will be mapped to lower key values. Thus, hashing would redistribute the queryIds
with large sets of queries. Like you mentioned, not the actual queries though.
The ranking experiment just added to AutoML does hash the queryIds
, so I don't believe we would encountered this issue there.
I see. Thanks for the clarification @Lynx1820. In this case, hashing will make the first fold not be over-sized, but there will still be some fold (perhaps not the first) that is over-sized.
Using hashing as opposed to value-to-key is preferable for another reason, which is relevant for the actual group Id column, not for the column used for CV: If you use value-to-key, and include that transformer in the model, then when you test the model on new data, with query Ids that were not seen in the training set, then their group Ids will all be mapped to 0 making the evaluator output wrong results (since all the queries that were not part of the training set will be considered as one group).
By the way, if I remember correctly, I think that LightGBM or FastTreeRanking (I don't remember which), checks that a group Id that has already been seen, is not seen again later in the dataset, so reusing query Ids in the same dataset might not work - has this been tested?
Hi, @yaeldekel
By the way, if I remember correctly, I think that LightGBM or FastTreeRanking (I don't remember which), checks that a group Id that has already been seen, is not seen again later in the dataset, so reusing query Ids in the same dataset might not work - has this been tested?
I'm not sure what you're referring to. Perhaps it's about the thing I noticed here: https://github.com/dotnet/machinelearning/issues/5022#issuecomment-630577977 (on the "Finally, ..." paragraph) and that was later confirmed by @justinormont here: https://github.com/dotnet/machinelearning/issues/5022#issuecomment-630588314 , where when training a LightGBM ranking model, it will group the samples with the same GroupId
only if they appear contiguously in the dataset... if they don't, we'll pretty much change the GroupId
to another one, in order to comply with LightGBM's requirement of needing GroupIds to be ordered in blocks (i.e. all samples with GroupId
"1" should be contiguous, all with GroupId
"2" should also be contiguous, and so on... this requirement is from the LightGBM library, and so our wrapper is only complying with it, and also for the reason raised by justin in the comment I've linked to). And so this doesn't throw exceptions, or make LightGBM to ignore samples, but it can output a not-so-correct model in some cases.
I'm not sure how that could be affected (or not) by the issue described here in #3711 .
So a couple of NimbusML tests are failing because of this issue (#3711), even if we're hashing the GroupId column.
What happens is that the failing NimbusML tests use a dataset which has only 8 unique GroupIds
, and runs Cross Validation with 5 folds using the GroupId
string column as sampling key column (a.k.a "stratification column"). The problem is that the returned 2nd fold has an empty test dataset, and this causes problems in the test (which expects data in all the test sets of every fold).
The reason for this is that when running CV if we have a string column as samplingKeyColumn
we'll hash it, and then add a RangeFilter
on top of it, where we divide the range of the filter to be of size 1/number_of_folds
. Then the RangeFilter
will take the hashed GroupdId
, divide it by the number of possible hashes (which is the cardinality of the hashed key column, i.e. 1073741824 for 31-bit hashes such as the one used on this case) and decide if the sample makes it to the filtered dataview (link to code).
So, for example, let's say that we have the string "GroupA" in the GroupId column, and that it hashes to 512345678. Then the range filter will assign it a value of 512345678/1073741824 = 0.477
. If we ran CV with 5 folds, then the RangeFilter will put the samples with 0-.2 inside the test set of the first fold, .2-.4 in the second fold, .4-.6 in the third fold and so on... Since "GroupA" was assigned 0.477, then it will be in the test set of the third fold.
Going back to the NimbusML tests, they fail because none of the 8 unique GroupIds is hashed into a value that will end up in the test set of the 2nd fold. If my math is correct, the probability of ending with the 2nd fold with an empty testset is (4/5)^8 = 0.1677
, so these tests had a 16% chance of failing with the empty 2nd fold.
I guess the general advice for this issue, until we get to actually find a way to fix it, is to simply reduce the number of folds to reduce the probability of ending with empty test sets. E.g., in the scenario I've mentioned, lowering the number of folds to 4 makes the problem go away (no empty test set). I'm still unsure on how to calculate the probability of getting at least one empty set given the number of unique GroupIds and number of folds, so I'm unable to give a general rule of thumb to users so that they can avoid this issue (given that they now the number of unique GroupIds).
Anyway, I want to leave this for the record, only so that we know that this is still a problem even if we prehash the column (cc: @Lynx1820 , does the new Ranking API for AutoML runs Cross Validation using the GroupId as samplingKeyColumn
? Would there be problems/exceptions if there's an empty fold? If so, we might want to keep this issue in mind, in case it causes troubles in the future).
For the record,
As discussed offline, I proposed a solution to this problem that would involve redesigning how we split data on TrainTestSplit, CVSplit (and the other related methods), to not use a RangeFilter
, but use another Filter (perhaps needing to create a whole new Filter) that would actually keep track of how many samples are assigned to each fold, so to make the folds balanced while still respecting the SamplingKeyColumn (aka Stratification column, i.e. GroupId on Ranking tasks)
Still, as discussed with @harishsk , I'll change this issue from P1 to P2, albeit it's a valid bug, since we are unsure of how much impact it will have on users (even with the new Ranking CV API and Ranking API for AutoML). Even more, there are simple workarounds for this issue, namely:
If they're getting this issue because the dataset has few unique GroupIds (as describe on my previous comment), then simply lower the number of folds (this will avoid empty folds, and not necessarily produced balanced folds... but I guess it's better than getting empty folds)
If they're getting this issue because they're loading the data as @yaeldekel described on her original post, then simply avoid loading the data in that manner, and load the column as Int and then add a ValueToKey transformer on top of the column.
The change in KeyDataViewType to disallow unknown cardinality keys, fixed a silent bug in CV that caused the group column to not be used for stratification (causing label leakage between the training and test sets), but now that it is fixed there is a new bug. CV now uses the GroupId column for stratification, but the values in the GroupId column may not be distributed evenly, causing all the examples to be in one fold, and the others would be empty. For example: Assume that the values in the GroupId column are
10,11,...,19
, and that they are loaded withTextLoader
as a key type column. The key type would be of cardinality 20, and if there are 2 folds, then CV will try to create one fold with all the examples with values0,...,9
and the other fold with all the examples with values10,...,19
.