JuliaML / MLDataUtils.jl

Utility package for generating, loading, splitting, and processing Machine Learning datasets
http://mldatautilsjl.readthedocs.io/
Other
102 stars 20 forks source link

Over and unders sampling #26

Closed oxinabox closed 7 years ago

oxinabox commented 7 years ago

One thing I can't quiet workout how to fit in oversampling via synthesis. I think it might require it's own method.

I kind of want a way to do it generically, so you could synthesis new examples by (for example) training a GMM on the examples you have of the class, and then sampling from that.

Evizero commented 7 years ago

To brainstorms the naming a bit. I think in general it is a good idea to choose names that popular frameworks also chose if they are appropriate.

After looking into that a little just now I think oversample and undersample are pretty good names. Maybe we could offer the alias upsample and downsample as well.

oxinabox commented 7 years ago

I guess nice thing about using upsample and downsample is that it avoids confusion with the oversampling and undersampling in signal processing. And ML is used a lot in signal processing (It is what everyone else in my lab is doing...).

It is probably worth looking to the literature. the SMOTE paper uses under-sampling and over-sampling https://www.jair.org/media/953/live-953-2037-jair.pdf

It is my personal opinion that having multiple names for the same thing is confusing. (if I were to put in aliases, I would make them just throw errors saying "You may have intended ....") But I feel it is beyond the scope of this PR to make points on that level of design

Evizero commented 7 years ago

Thanks! I'll merge as is. I am currently working on integrating MLLabelUtils as well as having a standard way of accessing "targets" of some data, so it is quite likely I'll adapt parts of your code. I'll ping you when I pushed some changes