Open HuyTu7 opened 6 years ago
- why an eval budget of 50,100? what do the smac people say about that?
This is to show how the effectiveness of the optimizer depends on the number of evaluation.
- what is DTC?
Decision trees
- results should raw values, not ranks, like fig4 of https://arxiv.org/pdf/1804.00626.pdf
Roger that.
- results should be not be presented across N data sets. need specifics per data sets.
ok
- when do we get to see FLASH results?
Friday.
- need to see the runtimes (or #evals) of each method... so we can assess the performance gains versus the computational effort
Will Do. Ken in on top of it.
the above sounds great
but you dodged one question about your eval budget:
what do the smac people say about that?
what do the smac people say about that?
smac people say that number of evaluations is not a good stopping criterion in the real world setting. Instead, the stopping rule should be defined in terms of time. This is the reason evaluation based stopping criterion is not even documented.
Each image file in the folder of a n_evaluations result is the scott-knott test chart for f1, precision, and time respectively.
25 evaluations results 50 evaluations results
Summary of the results: consolidated results.
Notes:
Fft? Flash?
The y-axis represents the median and the variance of the ranks from sk (lower the better).
This is across the 10 projects of defect prediction.
For 25 evals: (1) default median f1 is similar or better than rest of the optimizers, (2) SMAC is an exception in all cases but, the IQR is high, (3) Precision of default is worse than the optimized learners, and (4) SVM should not be used in defect prediction, whereas random forest performs the best.
Concerns: IQR of the ranks is concerning
For 50 evals: (1) The default setting is still better or similar to other optimized learners and the behavior is similar to eval 25. (2) Difference in Precision is more pronounced., and (3) Random Forst is definitely better in Precision, whereas DTC and RF is equally better in F1. (4) Interestingly SMAC is a better optimizer for F1 (not Precision). This might indicates some bug in autosklearn?
For 100 evals: (1) Outright we see the variance of the rank has decresed so is the differences in the performance of optimizer. In case of 25 and 50 evals, SMAC was the frontrunner whereas with 100 evals all the optimizers perform the same.
Note:
The variance is plotted as a standard deviation (which is symetric) and hence negative in few cases.
In some cases in SVM, we couldn't finish SMAC didn't terminate hence, the good performance of SMAC is questionable.
Precision is easier to optimize than F1---which is consistent with Fu et al. (IST 2016).