Initial Results - Githubissues

HuyTu7 commented 6 years ago

The y-axis represents the median and the variance of the ranks from sk (lower the better).
This is across the 10 projects of defect prediction.
For 25 evals: (1) default median f1 is similar or better than rest of the optimizers, (2) SMAC is an exception in all cases but, the IQR is high, (3) Precision of default is worse than the optimized learners, and (4) SVM should not be used in defect prediction, whereas random forest performs the best.

Concerns: IQR of the ranks is concerning

For 50 evals: (1) The default setting is still better or similar to other optimized learners and the behavior is similar to eval 25. (2) Difference in Precision is more pronounced., and (3) Random Forst is definitely better in Precision, whereas DTC and RF is equally better in F1. (4) Interestingly SMAC is a better optimizer for F1 (not Precision). This might indicates some bug in autosklearn?
For 100 evals: (1) Outright we see the variance of the rank has decresed so is the differences in the performance of optimizer. In case of 25 and 50 evals, SMAC was the frontrunner whereas with 100 evals all the optimizers perform the same.

25_f1 25_precision 50_f1 50_precision 100_f1 100_precision

Note:

The variance is plotted as a standard deviation (which is symetric) and hence negative in few cases.
In some cases in SVM, we couldn't finish SMAC didn't terminate hence, the good performance of SMAC is questionable.
Precision is easier to optimize than F1---which is consistent with Fu et al. (IST 2016).

timm commented 6 years ago

why an eval budget of 50,100? what do the smac people say about that?
what is DTC?
results should raw values, not ranks, like fig4 of https://arxiv.org/pdf/1804.00626.pdf
results should be not be presented across N data sets. need specifics per data sets.
when do we get to see FLASH results?
need to see the runtimes (or #evals) of each method... so we can assess the performance gains versus the computational effort

vivekaxl commented 6 years ago

why an eval budget of 50,100? what do the smac people say about that?

This is to show how the effectiveness of the optimizer depends on the number of evaluation.

what is DTC?

Decision trees

results should raw values, not ranks, like fig4 of https://arxiv.org/pdf/1804.00626.pdf

Roger that.

results should be not be presented across N data sets. need specifics per data sets.

ok

when do we get to see FLASH results?

Friday.

need to see the runtimes (or #evals) of each method... so we can assess the performance gains versus the computational effort

Will Do. Ken in on top of it.

timm commented 6 years ago

the above sounds great

but you dodged one question about your eval budget:

what do the smac people say about that?

vivekaxl commented 6 years ago

what do the smac people say about that?

smac people say that number of evaluations is not a good stopping criterion in the real world setting. Instead, the stopping rule should be defined in terms of time. This is the reason evaluation based stopping criterion is not even documented.

HuyTu7 commented 6 years ago

Each image file in the folder of a n_evaluations result is the scott-knott test chart for f1, precision, and time respectively. example

25 evaluations results 50 evaluations results

Summary of the results: consolidated results.

Notes:

RF is the clear learner winner and SMAC is the clear optimizer winner.
The default configuration of learners is also very effective. Also, note that we have not considered cases where there is no clear winner

timm commented 6 years ago

Fft? Flash?

ai-se / hyperall

Initial Results #15