ai-se / Caret

compare Caret with DE
0 stars 1 forks source link

reproduce Canadian results---test on tuning data #24

Open WeiFoo opened 8 years ago

WeiFoo commented 8 years ago

Compare my results with theirs

I used 20 iterations instead of 100 to get a quick result, the seed is different ........(I forgot to set the same seed as theirs)

Their C5.0 raw results with 100 iterations

##     optimize   default          system improvement
## 1  0.6812443 0.5276690             JM1  0.15357536
## 2  0.7718680 0.4831502             PC5  0.28871785
## 3  0.6817029 0.4994888       camel-1.2  0.18221415
## 4  0.8179451 0.4536583          prop-1  0.36428682
## 5  0.8545482 0.4823125          prop-2  0.37223573
## 6  0.7467892 0.4786966          prop-3  0.26809259
## 7  0.7624442 0.4425753          prop-4  0.31986892
## 8  0.7533058 0.4630926          prop-5  0.29021322
## 9  0.7626922 0.6653037       xalan-2.5  0.09738852
## 10 0.8443650 0.7655026       xalan-2.6  0.07886231
## 11 0.8401523 0.4243956     eclipse-2.0  0.41575663
## 12 0.7895530 0.4386999     eclipse-2.1  0.35085306
## 13 0.8010375 0.5310208     eclipse-3.0  0.27001672
## 14 0.8002733 0.5397900 eclipse34_debug  0.26048335
## 15 0.9685388 0.9149021   eclipse34_swt  0.05363675
## 16 0.8139885 0.5447301             jdt  0.26925838
## 17 0.7917020 0.4136083           mylyn  0.37809369
## 18 0.7260917 0.4600913             pde  0.26600047

My C5.0 raw results with 20 iterations(!!! the position of Camel is different)

       tuned   default   data_set improvement
1  0.6808698 0.4719743        jm1  0.20889548
2  0.7836656 0.5204229        pc5  0.26324270
3  0.8196309 0.4762283      prop1  0.34340262
4  0.8582901 0.5075686      prop2  0.35072151
5  0.7490999 0.4790891      prop3  0.27001083
6  0.7612573 0.4313464      prop4  0.32991097
7  0.7497757 0.4708872      prop5  0.27888844
8  0.6750136 0.5221968      camel  0.15281680
9  0.7741023 0.6605863    xalan25  0.11351602
10 0.8485274 0.7654055    xalan26  0.08312183
11 0.8130751 0.5425995  platform2  0.27047558
12 0.7465420 0.4759050 platform21  0.27063708
13 0.7643491 0.5623480  platfrom3  0.20200103
14 0.8043490 0.5504841    debug34  0.25386495
15 0.9672185 0.9151623      swt34  0.05205627
16 0.8057795 0.5890076        jdt  0.21677182
17 0.7861859 0.4349807      mylyn  0.35120518
18 0.7288896 0.4620439        pde  0.26684567

Their boxplots

ce4a8eee-5ef0-4ea6-ac5e-012d0e56b004

My box plots

361645ce-cd69-4d52-ad1d-08cce9a422d2

WeiFoo commented 8 years ago

@timm do you think this is quite similar? or I will repeat 100 times as theirs, and set the same seed.

timm commented 8 years ago

so this is repeating their methods where we test on the training?

if yes, then we need 3 box plots, side by side

1) their method's results <== reusing the same seed for training and test 2) our repeat, testing on training <== reusing the same seed for training and test 3) our repeat, testing on hold out <== different seeds training and test

t

timm commented 8 years ago

and you dont need big chunky box plots. my sideways ascii box plots will suffice

WeiFoo commented 8 years ago

sure, do you think I need to repeat 100 or 20 is enough?

timm commented 8 years ago

how many repeats do they do?

WeiFoo commented 8 years ago

they repeat 100.

timm commented 8 years ago

1) their method's results <== reusing the same seed for training and test 2) our repeat, testing on training <== reusing the same seed for training and test 3) our repeat, testing on hold out <== different seeds training and test

t

WeiFoo commented 8 years ago

1) their method's results <== reusing the same seed for training and test

This result can be obtained from their appendix, they all ready run that ==>result link

2) our repeat, testing on training <== reusing the same seed for training and test

My understanding, this is to use my own code to reproduce their results.

3) our repeat, testing on hold out <== different seeds training and test

This is to do tuning in the right way...

So....still my concern is how many repeats do I have to run, 100 or 20?

WeiFoo commented 8 years ago

ignore it. I will go with 100 to do the exactly same as theirs.

WeiFoo commented 8 years ago

@timm

My 3rd expeirmnt looks like this:

  1. tune(train_data, tune_data), repeat 100 times, get the best parameters.
  2. predict(train_data, test_data), repeat 100 times, report results (median or maximum???)

Since the train_data, tune_data, test_data are sampled from the original data each repeat, therefore, in their scheme, they repeat 100 times. I think I have to follow the similar way.

Question: according to their R code, max(optimize$results$ROC), they picked the maximum value from the 100 repeats as the best result for that data set. For me, when I run my 3rd experiment(mentioned above), do I pick the median of maximum for that data set? I prefer to use median, any comments?

timm commented 8 years ago

they picked the maximum value from the 100 repeats as the best result for that data set.

if "best" does not reference the test set that i would say its valid to use best.

does that mean your DE results (in the journal paper) could actually get BETTER results?

WeiFoo commented 8 years ago

if "best" does not reference the test set that i would say its valid to use best.

here, you mean the "best" in tuning proces, that makes sens.

does that mean your DE results (in the journal paper) could actually get BETTER results?

I used the "best" for DE, but the problem is we didn't repeat tuning several times for the same data set, then DE would suffer from randomness(that's true due to the randomly initialized popultions and evolutions afterwards), that means our previous tuning scheme could be improved by some technique, to make the optimized parameters returned from DE have more stable performance.

timm commented 8 years ago

what news?

WeiFoo commented 8 years ago

0310 results

Here, according their paper, report improvement of each learner by tuning over 18 data sets

Their results (by running their codes)

 Min.   1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05444 0.18450 0.27000 0.25320 0.33750 0.39700

My reproduced results (by running my codes)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05425 0.18060 0.26130 0.23730 0.27700 0.37010 

Test on hold-out data

For each data set I run the following experiment 

tuning_scores, default_scores = [],[]
for _ in range(0,10):
   test_data, train_data, tune_data = generate_data(data)
   tunings = []
   for _ in range(0,10):
      tunings.append(tune(train_data, tune_data))
   best_tuning = max(tunings)
   tuning_scores.append(test(train_data, test_data, best_tuning))
   default_scores.append(test(train_data, test_data, default_tuning))
improve_scores = tuning_scores - default_scores
return (tuning_scores, default_scores, improve_scores)
e.g.  returned values for JM1 data set

       tuned   default improvement
1  0.6050048 0.6509117 -0.04590682
2  0.6110150 0.6383667 -0.02735172
3  0.3946786 0.3670972  0.02758144
4  0.6126379 0.3585911  0.25404672
5  0.6094692 0.3481918  0.26127737
6  0.6354691 0.6306821  0.00478704
7  0.5779321 0.6216878 -0.04375572
8  0.5883462 0.3805968  0.20774937
9  0.5950603 0.6274107 -0.03235038
10 0.6087160 0.4196369  0.18907909

There're several ways to present the results, and they seem to make sense here...

Version A

(for each data set, final improvement out of 10 repeats= median(tuned)-median(default). pls refer to the example above)

    Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.032630  0.008301  0.092270  0.094890  0.156100  0.295400 

Version B

(for each dataset, final improvement out of 10 repeats = median(improvement), pls refer to the example above)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.033060  0.009489  0.083960  0.079480  0.123200  0.236500 

Version C, D.....

for each data set, final improvement out of 10 repeats = max(tuned)- max(default).......

But the improvement would be even worse than Version A and B. If you need I will calculate that.

WeiFoo commented 8 years ago

@timm, my comments are their results are gone. When testing on hold-out data set, the results I got here are similar to my previous results which I did 10 days back before contacting them, as below

C5.0:

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.025170  0.002164  0.071170  0.109000  0.218600  0.290000