Assessing uncertainty quantification quality metrics using the `design-bench` benchmarks

sgbaird commented 2 years ago

I'm considering using this for some simple tests of a few different uncertainty quantification quality metrics to see which ones are better predictors for how successful an adaptive design scheme will be.

From a very black box standpoint, what this requires is y_true, y_pred, and sigma (true, predicted, and uncertainty, resp.) and a "notion of best" for the adaptive design task.

Does that seem like something feasible/easy to implement with this repository? Or do you think it would be better to look elsewhere or start from scratch?

brandontrabucco commented 2 years ago

Hi sgbaird,

Great question! For most (see the note below) of the benchmarking tasks in design-bench, we use a procedure that collects a dataset of design values whose y_true spans both high-performing values and low-performing ones.

When we apply offline model-based optimization algorithms to these tasks, which are released in our https://github.com/brandontrabucco/design-baselines repository, we then typically subsample the original task dataset in order to hide some fraction of the high-performing designs from the optimizer. This serves to ensure that there exists headroom in the task objective function y_true(x) that the optimizer can be evaluated against.

For your purposes, you may find task.y helpful, which represents y_true(x) for every design value x in the subsampled task dataset that is typically passed to an optimizer.

In addition, if you would like to obtain the highest-performing designs, and their y_true values, you can modify the dataset subsampling hyperparameters used when loading the task dataset. In this example, I'll modify the parameters used to load the "Superconductor-RandomForest-v0" dataset, which would otherwise have a min_percentile of 0 and max_percentile of 40. In this example, we're effectively obtaining the held-out set of highest-performing designs:

import design_bench
max_percentile = 100
min_percentile = 40
task = design_bench.make("Superconductor-RandomForest-v0", 
    dataset_kwargs=dict(max_percentile=max_percentile , 
        min_percentile=min_percentile))

In terms of getting y_pred and sigma, this may depend on the optimization algorithm you are using. Several of the baselines implemented here (https://github.com/brandontrabucco/design-baselines) include a probabilistic neural network that fits a distribution to the objective function y_true(x), which may be used to obtain y_pred and sigma per task.

Let me know if you have any other questions!

-Brandon

NOTE:

Our HopperController suite of tasks does not use subsampling, and if the optimal performance is desired for this task, one can look at the performance of standard RL baselines on the Hopper-v2 MuJoCo task as reference.

sgbaird commented 2 years ago

Fantastic! Thank you for the thorough reply. This is great.

brandontrabucco / design-bench

Assessing uncertainty quantification quality metrics using the `design-bench` benchmarks #7