Closed oxinabox closed 7 years ago
To brainstorms the naming a bit. I think in general it is a good idea to choose names that popular frameworks also chose if they are appropriate.
upSample
, downSample
over_sampling
, under_sampling
After looking into that a little just now I think oversample
and undersample
are pretty good names. Maybe we could offer the alias upsample
and downsample
as well.
I guess nice thing about using upsample
and downsample
is that it avoids confusion with the oversampling
and undersampling
in signal processing.
And ML is used a lot in signal processing (It is what everyone else in my lab is doing...).
It is probably worth looking to the literature. the SMOTE paper uses under-sampling and over-sampling https://www.jair.org/media/953/live-953-2037-jair.pdf
It is my personal opinion that having multiple names for the same thing is confusing. (if I were to put in aliases, I would make them just throw errors saying "You may have intended ....") But I feel it is beyond the scope of this PR to make points on that level of design
Thanks! I'll merge as is. I am currently working on integrating MLLabelUtils as well as having a standard way of accessing "targets" of some data, so it is quite likely I'll adapt parts of your code. I'll ping you when I pushed some changes
One thing I can't quiet workout how to fit in oversampling via synthesis. I think it might require it's own method.
I kind of want a way to do it generically, so you could synthesis new examples by (for example) training a GMM on the examples you have of the class, and then sampling from that.