WeiFoo commented 8 years ago

Run

learners = ['naive_bayes']
methods = [ "_Naive", "_Smote", "_TunedLearner" , "_TunedSmote"]
 for feature_num in [100, 400, 700, 1000]:
    for  l in learners:
       for m in methods:
          seed(1)
          ten_folds_cross_valuation(l, m)

"_TunedLearner" and"_TunedSmote" happened in each fold

SMOTE params:

neighbors: k=[2,15], default is 5 over-sampling size for each minority(multiple classes): num = [10, max_num_majority], in SMOTE paper, num can be 2X~5X original size. Here I set the range from 10 to the number of majority class instances.

NB params:

alpha: [0.0, 1.0], Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing). default:1.0 fit_prior: [False, True] Whether to learn class prior probabilities or not. If false, a uniform prior will be used.

DE:

np = 10
cr = 0.3
f = 0.75
life = 5
max_repeats = 50 (only iterates 50 times, even though life>0)
Dataset:

anime.txt

Job submitted

hpc

WeiFoo commented 8 years ago

Why tune_smote so slowly

1st: 10 folds cross evaluation(Zhe used 25), we need to tune smote in each training data, totally 10 times
2nd: in DE, we have num_population. if we do 10* num_varaible, here, we have 28 variables(k, and 27 number of minority class), then we have to use np = 280, that's not acceptable because, for each candidate, when evaluate it, we have to call SMOTE(), generating new training data, fitting NB learner and predicting labels, finally get scores for that candidate. A lot of work to do and a lot of time here(30~60 seconds here for each run, depending on data size), I decided to use np=10.
3rd: if the frontier keeps improving, then we have to run more than 5 times, on average 10 runs

theoretically, we need run 10x10x10 =1000 SMOTE+Fitting learner+ Predicting.

For HPC

based on previous experience, 4 cores will run immediately. and 8 will wait for quite long time until have to kill jobs.

rahlk commented 8 years ago

Have you tried multiprocessing? It might be the answer to our time issues. Specifically:

1st: 10 folds cross evaluation(Zhe used 25), we need to tune smote in each training data, totally 10 times

Try to parallelize the 10 crossvals on 10 threads. Each crossval can occur concurrently so, this will produce a massive speed up (if you use HPCs, you may even get x10 speedups..)

2nd: in DE, we have num_population. if we do 10* num_varaible, here, we have 28 variables(k, and 27 number of minority class), then we have to use np = 280, that's not acceptable because, for each candidate, when evaluate it, we have to call SMOTE(), generating new training data, fitting NB learner and predicting labels, finally get scores for that candidate. A lot of work to do and a lot of time here(30~60 seconds here for each run, depending on data size), I decided to use np=10.

3rd: if the frontier keeps improving, then we have to run more than 5 times, on average 10 runs theoretically, we need run 10x10x10 =1000 SMOTE+Fitting learner+ Predicting.

A parallel DE might be a better option, we have a working version of a parallel DE, it is ridiculously fast on HPCs

based on previous experience, 4 cores will run immediately. and 8 will wait for quite long time until have to kill jobs.

There are several work arounds to this.

Check out bqueues -u <unity-id> it tells you the various queues you can use. See below:
Use reasonable wait times, -W 6000 is too high

Combine Multiprocessing with HPCs to get massive speedups.

WeiFoo commented 8 years ago

@rahlk Thanks a lot!!! I will change to use multiprocessing and modify my stuff accordingly. very good suggestion ! @rahlk another thing is, I already used mpi4py module to parallel my python code, would it cause some trouble or low-level conflicts with multiprocessing module? any idea?

WeiFoo commented 8 years ago

Results

rank ,                                          name ,    med   ,  iqr 
----------------------------------------------------
   1 ,                    NB_Naive_100_mean_weighted ,      44  ,    13 (-*       --    |              ), 0.42,  0.43,  0.44,  0.56,  0.59
   2 ,                   NB_Naive_1000_mean_weighted ,      59  ,     7 (       -- *  --|-             ), 0.53,  0.57,  0.59,  0.63,  0.69
   2 ,                    NB_Naive_400_mean_weighted ,      60  ,     8 (        -  *  -|--            ), 0.54,  0.57,  0.60,  0.64,  0.70
   2 ,                    NB_Naive_700_mean_weighted ,      63  ,     3 (             *-|---           ), 0.59,  0.60,  0.63,  0.63,  0.71
   2 ,               NB_TunedSmote_100_mean_weighted ,      65  ,     8 (       ----    * -            ), 0.53,  0.60,  0.66,  0.69,  0.70
   2 ,                    NB_Smote_100_mean_weighted ,      65  ,     9 (     -------   *-             ), 0.50,  0.61,  0.66,  0.68,  0.69
   2 ,             NB_TunedLearner_100_mean_weighted ,      67  ,     7 (     --------- |*--           ), 0.50,  0.65,  0.67,  0.69,  0.72
   3 ,                    NB_Smote_400_mean_weighted ,      79  ,     2 (               |  ----- *     ), 0.70,  0.79,  0.80,  0.80,  0.81
   3 ,               NB_TunedSmote_400_mean_weighted ,      79  ,     3 (               |    --- *     ), 0.73,  0.79,  0.80,  0.80,  0.81
   3 ,             NB_TunedLearner_400_mean_weighted ,      80  ,     1 (               |   ----- *    ), 0.72,  0.80,  0.81,  0.82,  0.83
   4 ,               NB_TunedSmote_700_mean_weighted ,      82  ,     1 (               |       -- *   ), 0.78,  0.82,  0.83,  0.84,  0.85
   4 ,                    NB_Smote_700_mean_weighted ,      84  ,     2 (               |       -- *   ), 0.78,  0.82,  0.84,  0.84,  0.84
   4 ,              NB_TunedSmote_1000_mean_weighted ,      84  ,     2 (               |       --- *  ), 0.78,  0.83,  0.84,  0.85,  0.86
   4 ,             NB_TunedLearner_700_mean_weighted ,      85  ,     2 (               |        ---*- ), 0.80,  0.85,  0.85,  0.87,  0.87
   4 ,                   NB_Smote_1000_mean_weighted ,      84  ,     2 (               |        ---*- ), 0.79,  0.84,  0.85,  0.86,  0.87
   4 ,            NB_TunedLearner_1000_mean_weighted ,      86  ,     1 (               |       -----* ), 0.79,  0.86,  0.86,  0.88,  0.89

NB_TunedSmote_1000_mean_weighted means:

leaner is Naive Bayes
Tuned Smote
select 1000 features
goal is weighted_mean of F-measure (we have multiple classes)

Time: takes 26 hours!

Obvservation

Tuning Learner is always better in this limit experiments
Tuning Smote is the same as Smote
Naive Learner is the worst one

azhe825 commented 8 years ago

0.8 is great. Two things: 1. I prefer unweighted_mean of F-measure, 2. do you have the result of oversampling rate after tuned?

rahlk commented 8 years ago

@WeiFoo With multiprocessing? Or without? Also, Parallel DE?

WeiFoo commented 8 years ago

I think we discussed this well @rahlk @azhe825 and I will close it

ai-se / SMOTE

EXP on [ "_Naive", "_Smote", "_TunedLearner , "_TunedSmote"] #9

Run

SMOTE params:

NB params:

DE:

Dataset:

Job submitted

Why tune_smote so slowly

For HPC

Results

Obvservation