WeiFoo commented 8 years ago

Experiment Setting

Data set : totally 18 data sets, the same data as in Canada Paper.
Learners：（Call the following learners directly from R）
- CART:
- complexity: A value that is used to penalize the error rate of terminal nodes of the tree.
  - default: 0.01
  - search space:**[0.0001~0.5]***
- C5.0:
- trails: an integer specifying the number of boosting iterations, A value of one indicates that a single model is used.
  - default : 1
  - search space: **[1~50]***
- rules ： A logical: should the tree be decomposed into a rule-based model?
  - default: TRUE
  - search space: [FALSE, TRUE](equivalent:[tree, rule])
- winnow: A logical: should predictor winnowing(i.e, feature selection) be used?
  - default: FALSE
  - search space:[TRUE, FALSE]
- Model Averaged NNet：
- bag: a logical for bagging for each repeat
  - default: FALSE
  - search space: [TRUE, FALSE]
- weight decay: used to prevent the weights from going too large (?????)
  - default: 0.1(this is canada default, but, in caret, it is 0 )
  - search space: **[0~0.1]***
- size: number of units in the hidden layer. Can be zero if there are skip-layer units.
  - default: 1
  - search space: **[1~9]***
Training, Tuning, Testing Data: Bootstrap Sample (this is to be equivalent to canada strategy )
- Training data(1111111): A boot strap sample of size N is randomly drawn with replacemeant from an original dataset, which is also of size N.
- New_training(222222): repeat the above step, generate a new data set of size N, with replacement
- New_tuning: the data rows in 1111111 but not in 222222 will be the new_tuning data.
- Testing data: the rows that don't appear in bootstrap sample will go to testing data
Details:
- Each evaluation in tuning will do generate training, tuning and then evaluate. repeat this step 10 times and return the median value to reduce the randomness.
- Testing(with default parameters or tuning parameters): as evaluation in tuning, evaluate the same parameters 10 times, and return the median value as the final results for that parameter over that data set.

*Note:

searching space is generated strictly according to the grid from their gridserach. Extremely small space.__.
no reflection on those learners. just used as it is in the paper.

WeiFoo commented 8 years ago

Results (===>Results for each data set)

DE Improvements over 18 datasets

********** auc **********
rank ,                 name ,    med   ,  iqr 
----------------------------------------------------
   1 ,               avnnet ,       1  ,     4 (         - *---|-             ),-0.01,  0.01,  0.02,  0.04,  0.14
   1 ,                  C50 ,       2  ,    15 (        -  *   |  ---------   ),-0.03,  0.00,  0.03,  0.15,  0.30
   1 ,                 CART ,       2  ,    21 (  -------  *   |    ---       ),-0.12,  0.00,  0.02,  0.18,  0.24
End time :2016-03-05 07:13:53

My Reproduced Caret Results

C5.0:

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.025170  0.002164  0.071170  0.109000  0.218600  0.290000

AVNNet:

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.009183  0.016080  0.033660  0.044440  0.053520  0.149100

CART:

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.04478  0.01746  0.05063  0.10520  0.20120  0.30740

Their Results

WeiFoo commented 8 years ago

Observation

DE does not improve too much overall.
my DE rig and my reproduced results are the same, the max improvements are their medians.
For some data sets, like prop1, prop2, DE did improve over 20 percentage points.
DE use 50~90 evaluations

WeiFoo commented 8 years ago

Why

probably, the data set sampling technique is different from theirs, I missed some details. need to wait for their confirmation.
DE has a very limited searching space for these learners.

WeiFoo commented 8 years ago

Takeaway

Over the past month, I did several experiments, like R caret, DE tune R learner, DE with new tuning data sets. Compared with my first tuning work, where has large improvement in some data set, DE doesn't always get good performance here. Why? data sets, or specifically, the data distribution is different.

Even though the canada paper got "very good" performance, look at their data, firstly, they only used the datasets with EPV>=10(again, EPV!!!). Also their bootstrap sampling is also the thing to change the data distribution. Even though they probably wont interpret their results in this way.

I have a strong intuition that instead of devoting efforts on tuning learner, we have to "tune" data as well! Choose the right data for tuning.

In terms of accuracy, what' the main problem of tuning, overfitting. The parameters got from tuning is to some @extent overfitting to the tuning data. Two extreme examples here. What if the tuning data is totally different from actually testing data, then you will got negative results ideally, that means tuning decrease the performance and you won't trust tuning any more. What if the tuning data is just the same as the testing data, then you got better results.

People do tuning with the assumption tuning data has the same distribution as the testing data. For me, I assumed a lot during the past days, but I never check. How to measure the same distribution or even wether the tuning and testing data are similar... euclidean distance is the easiest way to come up but not sure whether the good one(need to check references.)

That's my takeaway, before tuning, look at your data, choose the right tuning data, then apply your technique.

WeiFoo commented 8 years ago

Next Step

timm commented 8 years ago

if u think there is a methodological flaw in the canadian paper, then that is a statement

do you?

timm commented 8 years ago

[x] "their bootstrap sampling is also the thing to change the data distribution" what sampling? on the train AND test data? if yes then... that might be a debatable methodological point.

For data set A with size of N, the training data will be a sample of sinze of N randomly drawn with replacement from A, and the testing data will be the data not apparing in training data. Theoratically, 36.8% of the original data will not appear in the training data, they will be the testing data.

I think this method seems OK, but it could support my idea that training and testing has the similar distribution.

[x] "I have a strong intuition that instead of devoting efforts on tuning learner, we have to "tune" data as well! Choose the right data for tuning.". So is this what you mean?
- cluster the training and test data (clustering each separately, of course)
- run all our tuning methods on the training data clusters
- find some distance measure between training clusters and test clusters (? some variant of KS? data sets are similar if they have the same distributions? or any othe ones shown below?)
- run the testing clusters using the tunings generated by their nearest training cluster

yes, this is exactly one idea, but you mentioned it as training data,
I probably think it could be tuning data(validataion data) whatever... the idea is right.

another idea is :

cluster both training and testing data as a whole(but we could have an extra column to differentiate them).
chose those data sitting close to testing data as training(or tuning) data.
then run tuning on the selected trainning data
run testing using the tunings generated by selected training data.

one criticism maybe we don't have testing data when do tuning. But it makes sense that we could have limited testing data ready before tuning. and this process can be modified and adjustd as we have more and more testing data comes in. @timm

timm commented 8 years ago

also, be great to see the above as an improvement graph

WeiFoo commented 8 years ago

Q1:

For data set A with size of N, the training data will be a sample of size of N randomly drawn with replacement from A, and the testing data will be the data not apparing in training data. Theoratically, 36.8% of the original data will not appear in the training data, they will be the testing data.

I think this method seems OK, but it could support my idea that they somehow training and testing has the similar distribution.

Q2:

yes, this is exactly one idea, but you mentioned it as training data,
I probably think it could be tuning data(validataion data) whatever... the idea is right.

another idea is :

cluster both training and testing data as a whole(but we could have an extra column to differentiate them).
chose those data sitting close to testing data as training(or tuning) data.
then run tuning on the selected trainning data
run testing using the tunings generated by selected training data.

one criticism maybe we don't have testing data when do tuning. But it makes sense that we could have limited testing data ready before tuning. and this process can be modified and adjustd as we have more and more testing data comes in. @timm

timm commented 8 years ago

cluster both training and testing data as a whole(but we could have an extra column to differentiate them).

not big on that

chose those data sitting close to testing data as training(or tuning) data.

yes. cluster training and test and use tunings from training clusters near test data to select what tunings to apply. note: that definition of near must not use dependent variables in testing.

one criticism maybe we don't have testing data when do tuning.

so pretend that you are tuning monday, tuesday wed then wait for thurs frid to test. no downside, just as long as no information from tune goes back t train

ai-se / Caret

DE tune R learner #22

Experiment Setting

Results (===>Results for each data set)

DE Improvements over 18 datasets

My Reproduced Caret Results

Their Results

Observation

Why

Takeaway

Next Step