Addresses #64. evaluate_btb.py now runs 10 independent trials using the specified tuner/selector/dataset combination. It compares the mean, minimum, and standard deviation of the AUC over the 10 trials to the baseline (calculated also from 10 independent trials).
Test output:
If run ids are passed in, it compares the baseline to each specified run.
Test output for specified run ids:
Addresses #64. evaluate_btb.py now runs 10 independent trials using the specified tuner/selector/dataset combination. It compares the mean, minimum, and standard deviation of the AUC over the 10 trials to the baseline (calculated also from 10 independent trials). Test output: If run ids are passed in, it compares the baseline to each specified run. Test output for specified run ids: