dmbee / seglearn

Python module for machine learning time series:
https://dmbee.github.io/seglearn/
BSD 3-Clause "New" or "Revised" License
571 stars 63 forks source link

Resampling with imbalanced-learn samplers #15

Closed qtux closed 5 years ago

qtux commented 5 years ago

Hi David,

I added the patch_sampler(imblearn_sampler_class) function which can be used to derive a dynamically created (and pickable) sampler class compatible with Pype. The derived class implements a transform method which returns the data unchanged. The fit_transform method calls the fit_resample method of the imbalanced-learn sampler which resamples the data. These steps are important to ensure that resampling only applies to training data but not to test data (the example shows that Pype.fit calls the fit_transform method, whereas score calls the transform method).

Cheers, Matthias

coveralls commented 5 years ago

Pull Request Test Coverage Report for Build 225


Changes Missing Coverage Covered Lines Changed/Added Lines %
seglearn/pipe.py 1 2 50.0%
seglearn/transform.py 66 68 97.06%
seglearn/util.py 15 18 83.33%
<!-- Total: 282 288 97.92% -->
Totals Coverage Status
Change from base Build 197: 0.5%
Covered Lines: 1923
Relevant Lines: 2022

💛 - Coveralls
dmbee commented 5 years ago

Thanks Matthias. I'll have to look this over later this week. Thanks again for the contribution. David

qtux commented 5 years ago

I rebased the commits on the current development branch.

qtux commented 5 years ago

Hi David,

I added the possibility to shuffle the resampled results. The reason for this feature is that e.g. the RandomUnderSampler seems to sort the X/y arrays by the class of y. This turns out to be problematic when using the fixed validation_split to fit a Keras classifier on resampled and segmented data. A solution to provide validation_data without using the validation_split seems to be more complex.

Cheers, Matthias

dmbee commented 5 years ago

Thanks Matthias - I have to spend some more time looking at this. I am working on sklearn on how best to integrate resampling into their pipeline as well.

Here is the thread for the discussion: https://github.com/scikit-learn/scikit-learn/issues/3855

dmbee commented 5 years ago

The reason for this feature is that e.g. the RandomUnderSampler seems to sort the X/y arrays by the class of y.

This is really good to know.

qtux commented 5 years ago

I rebased the commits on the current development branch.

qtux commented 5 years ago

I rebased the commits on the current development branch.

qtux commented 5 years ago

Hi David,

I rebased the resampling patches to the master branch and squashed the commits such that it would be easier to revert them. What do you think about merging this patch set? It seems that scikit-learn needs some more time until they might provide this feature (c.f. https://github.com/scikit-learn/scikit-learn/pull/13269).

Should I change this pull request from the dev to the master branch?

Cheers, Matthias

dmbee commented 5 years ago

Hi Matthias,

I really appreciate your work on this. I am pretty busy over the next two weeks but promise to look over this again soon. Last time I wasn't too keen on adding the dependency of imblearn.

Let me look it over again and let us then discuss.

David

qtux commented 5 years ago

Hi David,

any news?

Cheers, Matthias

dmbee commented 5 years ago

Matthias - truly apologize for the delay as I am writing my thesis currently. This looks great. Can you please rebase to the current master and I will merge and deploy soon as that's done.

I appreciate all your work on this really useful patch.

David

qtux commented 5 years ago

Hi David,

no worries, all the best for your thesis :).

Cheers, Matthias

dmbee commented 5 years ago

Thanks Matthias