analyticalmindsltd / smote_variants

A collection of 85 minority oversampling techniques (SMOTE) for imbalanced learning with multi-class oversampling and model selection features
http://smote-variants.readthedocs.io
MIT License
631 stars 137 forks source link

add_new_oversampling_technique_symprod #38

Closed intouchkun closed 2 years ago

intouchkun commented 3 years ago

Hello György Kovács, I have added a new oversampling technique called 'A Synthetic Minority Based on Probabilistic Distribution (SyMProD)' , which I implemented and published via https://ieeexplore.ieee.org/document/9119990. May you review it and if it has any error or suggestion, please let me know or comment to this pr. thank you.

gykovacs commented 3 years ago

Hi!

This is great! I'll look into it as soon as possible!

Best Regards, Gyuri Kovács

On Sun, 7 Mar 2021 at 17:08, Intouch Kunakorntum notifications@github.com wrote:

Hello György Kovács, I have added a new oversampling technique called 'A Synthetic Minority Based on Probabilistic Distribution (SyMProD)' , which I implemented and published via https://ieeexplore.ieee.org/document/9119990. May you review it and if it has any error or suggestion, please let me know or comment to this pr. thank you.

You can view, comment on, or merge this pull request online at:

https://github.com/analyticalmindsltd/smote_variants/pull/38 Commit Summary

  • add_new_oversampling_technique_symprod

File Changes

Patch Links:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/analyticalmindsltd/smote_variants/pull/38, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOLICVXQFOU3KKJK7MESQTTCOQJRANCNFSM4YX74H7A .

-- György Kovács, PhD

Email: gyuriofkovacs@gmail.com Phone: +36208000053 Web: http://gykovacs.github.io GitHub: http://github.com/gykovacs

gykovacs commented 3 years ago

As far as I see in the CI logs (https://travis-ci.com/github/analyticalmindsltd/smote_variants/jobs/488846311), your implementation fails on some edge cases.

In order to ensure that the implemented oversamplers do not break existing machine learning pipelines when they are integrated into them, the techniques are tested with a bunch of edge cases, like very skewed datasets and only a couple of vectors. You can check all the tests here https://github.com/analyticalmindsltd/smote_variants/blob/master/tests/tests.py, or even execute them on your own.

You should think about issues, like: you try to determine the 5 closest neighbors, but your data consists of 3 vectors altogether.

Let me know if you need further help in identifying and fixing the implementation.

intouchkun commented 3 years ago

Thank you for the information. I'll solve the problems and push again.

codecov[bot] commented 3 years ago

Codecov Report

Merging #38 (d8b2bcf) into master (dedbc3d) will not change coverage. The diff coverage is 0.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #38    +/-   ##
=======================================
  Coverage    0.00%   0.00%            
=======================================
  Files           3       3            
  Lines        7413    7522   +109     
=======================================
- Misses       7413    7522   +109     
Impacted Files Coverage Δ
smote_variants/_smote_variants.py 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update dedbc3d...d8b2bcf. Read the comment docs.

gykovacs commented 3 years ago

Thank you! I can see there are still problems with some tests, but this seems to be related to updates in packages I used. Let me hand it over this point, look into it as soon as I can, and update you if there is anything else to do on your end!

Thank you for your contribution so far!

gykovacs commented 2 years ago

Hi @intouchkun , I have added your method SYMPROD to the package (in a separate PR). I applied some changes to make the implementation a bit more clean, also, I think the inverse transformation of standard scaling was lacking as the last step, to make the new samples comparable to the original ones. It would be great if you could check if my changes are correct and work.