Discussion about the imbalance in the dataset

geneura-papers / 2017-GPRuleRefinement

Repository for the GPRuleRefinement paper to be sent to a Journal.

Artistic License 2.0

0 stars 0 forks source link

Discussion about the imbalance in the dataset #14

Closed unintendedbear closed 7 years ago

unintendedbear commented 8 years ago

Dear @zeinebchelly @JJ and @fergunet,

The dataset we're working with has the following distribution of patterns per class:

45856 are GRANTED 3350 are STRONGDENY

This means 93% - 17% :( What do you think that can be done? Does this affect the GP as it affects to classification?

JJ commented 8 years ago

Training sets have to be balanced. So you'll have to choose randomly from the biggest set to make it as small as the smallest.

unintendedbear commented 8 years ago

Ok, so we have to start the experiments all over again, great!

Anyways, what about the second question? It affects exactly the same? How?

unintendedbear commented 8 years ago

BTW if we do what you propose, @JJ we will lose almost all the information in the dataset...

unintendedbear commented 8 years ago

Hi all,

based on the results of this paper: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6029340&tag=1, for the ratio of our data, which is 1:13 (almost 14), the best fitness function to deal with such imbalance is Wmw (Wilcoxon–Mann–Whitney). So I will go for implementing this and see what changes :)

zeinebchelly commented 8 years ago

:) Good

2016-05-31 8:59 GMT+01:00 Paloma de las Cuevas Delgado < notifications@github.com>:

Hi all,

based on the results of this paper: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6029340&tag=1, for the ratio of our data, which is 1:13 (almost 14), the best fitness function to deal with such imbalance is Wmw (Wilcoxon–Mann–Whitney). So I will go for implementing this and see what changes :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneura-papers/2016-PPSN/issues/14#issuecomment-222618126, or mute the thread https://github.com/notifications/unsubscribe/AJgODuHO4tMTd9d616TMV7vHbsQ41Ftkks5qG-pSgaJpZM4IPrJb .

7ossam81 commented 7 years ago

Hi all,

I just came across this issue and I wanted to drop an advice :)

In such highly imbalanced class distribution, I prefer to go with oversampling the data rather than undersampling. With oversampling you can increase the percentage of the rare class randomly or by using some excellent oversampling techniques like SMOTE. I prefer oversampling over undersampling since undersampling techniques could lead to loss of information. With SOMTE you can control the percentage you want to create from the small class.

Hope this could help in this issue. Please let me know if you need any help with this. I have good experience dealing with highly imbalanced datsets :)

unintendedbear commented 7 years ago

Thank you very much @7ossam81 :D

I also prefer oversampling. The thing is that we will have to leave the addressing of this problem aside for this paper, but I will implement it for the thesis.

Actually, the imbalance is not a problem when we code 1 individual as 1 set of rules, but is really noticeable when we make 1 individual = 1 rule (we have to run the algorithm 1 time per class, and the results for the majority clas >>>>>> results for minority class, obviously). So I will definitely go for your suggestion once I've sent the paper and continue with the next/the thesis, in which we address this problem.

What do you think? Also @JJ and @fergunet

JJ commented 7 years ago

It's a great suggestion for follow up work, but for the time being it's the best to finish this paper...