dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

No support for stratified split in dask_ml.model_selection.train_test_split #535

Open chauhankaranraj opened 5 years ago

chauhankaranraj commented 5 years ago

scikit-learn implementation of train test split (sklearn.model_selection.train_test_split) supports splitting data according to class labels (stratified split) by using the argument stratify. This is especially useful when datasets have high class imbalance. It would be really helpful to have this feature in dask_ml as well.

TomAugspurger commented 5 years ago

Agreed. Are you interested in working on this?

On Fri, Aug 9, 2019 at 3:18 PM Karanraj Chauhan notifications@github.com wrote:

scikit-learn implementation of train test split (sklearn.model_selection.train_test_split) supports splitting data according to class labels (stratified split) by using the argument stratify. This is especially useful when datasets have high class imbalance. It would be really helpful to have this feature in dask_ml as well.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/535?email_source=notifications&email_token=AAKAOITYSEXRW4TUOTW6OFDQDXGIXA5CNFSM4IKXCDHKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HEPIOUQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIQTYGEUZ6B7ABDQK4TQDXGIXANCNFSM4IKXCDHA .

chauhankaranraj commented 5 years ago

Tempted to say yes, but I don't know the codebase/internals very well (specifically, I'm not sure how we can get a stratified split with blockwise=False not implemented for the ShuffleSplit class).

So it'd be faster if someone more knowledgeable could volunteer. If not then I'd be happy to give it a shot, but it might take some time.

TomAugspurger commented 5 years ago

That's great if you're willing to try. Let us know if you get stuck.

On Sun, Aug 11, 2019 at 6:40 PM Karanraj Chauhan notifications@github.com wrote:

Tempted to say yes, but I don't know the codebase/internals very well (specifically, I'm not sure how we can get a stratified split with blockwise=False not implemented for the ShuffleSplit class).

So it'd be faster if someone more knowledgeable could volunteer. If not then I'd be happy to give it a shot, but it might take some time.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/535?email_source=notifications&email_token=AAKAOISB25FYPBPYRR4KOFTQECPN5A5CNFSM4IKXCDHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4BLGTI#issuecomment-520270669, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIQVJEDB5XD7VSQZO4TQECPN5ANCNFSM4IKXCDHA .

tiagofassoni commented 4 years ago

Hey, Tom, I'm thinking of picking this up. My doubt is:

Say we have a big csv file with 2 categories and two partitions of the data.

So file_0 has only category 0 and 1, file_1 has only category 1.

My first thought was to just use the stratify parameter of scikit-learn, but in this case that wouldn't work. Another idea would be to compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

I'd be glad to pick this up, as it would help in some research I'm doing.

TomAugspurger commented 4 years ago

@tiagofassoni great! dask-ml's OneHotEncoder may be helpful here. It will use the Categorical dtype for pandas dataframes. Otherwise you can (or maybe need?) to pass the categories manually as a list / array. Does that make sense?

In other places that just work with arrays, like Incremental, we require that the classes (groups in this case) be specified ahead of time.

jerrytim commented 4 years ago

is there any luck with this feature request? in the case of huge imbalanced dataset, the stratify argument in train_test_split is useful

TomAugspurger commented 4 years ago

I’m not aware of any progress. Perhaps Tiago can share a status update.

On Feb 21, 2020, at 7:37 AM, Tim Huang notifications@github.com wrote:

 is there any luck with this feature request? in the case of huge imbalanced dataset, the stratify argument in train_test_split is useful

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

tiagofassoni commented 4 years ago

Hello, @TomAugspurger, @jerrytim. Got to try my hand on this just last week and... gotta say, I have no idea on how to make it. I don't know why OneHotEncoder would be helpful, if at all.

I was thinking of using something like Pandas' value_counts for the series results and then trying to make a shuffle, but I don't know if such an approach is feasible.

chauhankaranraj commented 4 years ago

@TomAugspurger I agree with @tiagofassoni - I'm not sure how OneHotEncoder can be used. But I also don't understand how value_counts can be used - @tiagofassoni could you please elaborate?

There's two things I wanted to bring into discussion that might help us better decide how to implement. IIUC splitting is handled differently for da.Array and dd.Series/dd.DataFrame, correct?

  1. For dd.Series/dd.DataFrame, heavy lifting is done by random_split but I couldn't find its source code. So I'm not 100% sure how to deal with that case.
  2. For da.Array, heavy lifting is done by ShuffleSplit and _blockwise_slice. Could we get parts of the input array of that belong to a particular class, compute chunks of this subarray, apply the same ShuffleSplit+_blockwise_slice strategy for this subarray, repeat for all classes, and finally concatenate the results? This would be kind of along the same lines as @tiagofassoni's comment:

    Another idea would be to compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

TomAugspurger commented 4 years ago

random_split but I couldn't find its source code. So I'm not 100% sure how to deal with that case.

That's in dask.dataframe.DataFrame.random_split

compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

In these cases we typically require the user to provide the set of classes up front. But in this case, do we need to make a pass over the data to determine the frequency of each class? Or can that be done lazily?

chauhankaranraj commented 4 years ago

That's in dask.dataframe.DataFrame.random_split

Gotcha, thanks! I'll take a look :)

In these cases we typically require the user to provide the set of classes up front. But in this case, do we need to make a pass over the data to determine the frequency of each class? Or can that be done lazily?

Yeah, I agree - having the classes up front would be ideal. We could still compute the classes (da.unique on stratify array) but I don't think that can be done lazily and thus wouldn't be ideal.

Maybe I'm missing something here, but do we really need the frequencies? This might be a little far from optimal, but could we do something along these lines:

train_test_pairs = []
for arr in arrays:

    # create subarrays for each class, apply split on subarrays individually
    arr_train_test_pairs = [[], []]
    for ci in classes:
        ci_arr = arr[_stratify==ci]
        ci_arr.compute_chunk_sizes()
        train_idx, test_idx = next(splitter.split(ci_arr))
        arr_train_test_pairs[0].append(_blockwise_slice(ci_arr, train_idx))
        arr_train_test_pairs[1].append(_blockwise_slice(ci_arr, test_idx))

    # concat all train subarr as 1 train arr, all test subarr as 1 test arr
    arr_train_test_pairs[0] = da.concatenate(arr_train_test_pairs[0])
    arr_train_test_pairs[1] = da.concatenate(arr_train_test_pairs[1])
    train_test_pairs.append(arr_train_test_pairs)

return list(itertools.chain.from_iterable(train_test_pairs))
trail-coffee commented 4 years ago

@chauhankaranraj does that split the class subarrays evenly? So it's a stratified train/test with 0.5 test and 0.5 train?

Note: I'm a data scientist, not a developer...

train_test_pairs = []
for arr in arrays:

    # create subarrays for each class, apply split on subarrays individually
    arr_train_test_pairs = [[], []]
    for ci in classes:
        ci_arr = arr[_stratify==ci]
        ci_arr.compute_chunk_sizes()
        train_idx, test_idx = next(splitter.split(ci_arr))
        arr_train_test_pairs[0].append(_blockwise_slice(ci_arr, train_idx))
        arr_train_test_pairs[1].append(_blockwise_slice(ci_arr, test_idx))

    # concat all train subarr as 1 train arr, all test subarr as 1 test arr
    arr_train_test_pairs[0] = da.concatenate(arr_train_test_pairs[0])
    arr_train_test_pairs[1] = da.concatenate(arr_train_test_pairs[1])
    train_test_pairs.append(arr_train_test_pairs)

return list(itertools.chain.from_iterable(train_test_pairs))

Sklearn uses np.bincount in class StratifiedShuffleSplit in sklearn.model_selection._split to get frequencies and split accordingly.

chauhankaranraj commented 4 years ago

@chauhankaranraj does that split the class subarrays evenly? So it's a stratified train/test with 0.5 test and 0.5 train?

@ericbassett It should split it in whatever train/test ratio is provided as input. The splitter used here is the instance of ShuffleSplit that gets created here. IIUC it takes care of splitting by ratios provided.

I'll submit a WIP PR soon so this discuss becomes more concrete :)

trail-coffee commented 4 years ago

Very nice, makes sense.

chauhankaranraj commented 4 years ago

Hey folks,

I made an attempt to implement the stratified split here. I could do it lazily for dask Series and DataFrames, but not completely lazily for dask Array (calling compute_chunk_sizes()).

Does anyone have ideas to get around this? Would it be possible to "enforce" chunk size instead of computing it? [e.g. if chunk size for the whole array is (x, 10) then chunk size for the part of the array that belongs to a class with weight 15% should be (0.15x, 10)]

Any feedback in general would be highly appreciated :pray:

Also, if you feel this discussion should be moved to a WIP PR, I can open that too.

TomAugspurger commented 4 years ago

May be easiest to move to a PR. We might be able to do things lazily for dask array, we'll just probably end up with unknown chunk sizes.

chauhankaranraj commented 4 years ago

@TomAugspurger Sure thing. Opened this WIP PR yesterday

ashokrayal commented 2 years ago

Any Progress on this task ? :)

kennylids commented 2 years ago

I need the stratify feature in tran_split_test as well for my imbalanced dataset. Any updates?

chauhankaranraj commented 2 years ago

Hey folks, sorry but I haven't had the chance to continue working on this. I did open a WIP PR (#635) so if anyone would like to fork off of it or just start from scratch, feel free to do so! Let me know if you'd like anything from me in doing so.