Issue with evaluating multiobjective benchmarks

thchang commented 1 year ago

Hi all -- I am interested in running the multiobjective variation of this benchmark and encountered the following issues, please advise:

For the hypervolume metric to be comparable across different solvers, everyone needs to use the same reference point. Therefore, in https://automl.github.io/jahs_bench_201/evaluation_protocol you must provide a recommended reference point for everyone to measure hypervolume with respect to. The reference point must be a finite performance value for each objective for which it is impossible to do worse.

I looked up the reference point for your reported results for the multiobjective random sampling approach, and in line 19-20 of this file: https://github.com/automl/jahs_bench_201_experiments/blob/master/jahs_bench_201_experiments/analysis/leaderboard.py and it looks like you are calculating the reference point as the minimum of all observed values. Is that correct? Such a policy would indirectly reward methods that sample very bad configurations and penalize methods that never take a "bad" evaluation.
Also in the same file, it looks like the 2 objectives for the multiobjective variation of the benchmarks are to maximize validation accuracy + minimize latency? It would be nice to clarify this somewhere in https://automl.github.io/jahs_bench_201/evaluation_protocol as well, as it was difficult to find.

eddiebergman commented 1 year ago

Hi @thchang, you are indeed correct that this is not the proper way to calculate hypervolume, and instead it should work on the boundaries of metrics, not mins or maxes. Therefore it requires knowing the metrics used and their boundaries in the hypervolume calculation function.

@NeoChaos12 are you still up for maintaining this?

CC: @DaStoll

thchang commented 1 year ago

Thanks for the reply!

In case it helps, for now I've been using the following as my reference point to get started:

val_acc_min = 0.0
latency_max = 10.0

but I've only tried this with the CIFAR 10 benchmark, and haven't investigated the others yet

NeoChaos12 commented 1 year ago

Hi @thchang,

Thank you for bringing this to our attention! This sounds reasonable, though I would like to hear @worstseed's opinion on this issue.

@eddiebergman I may not be able to step back into an active maintainer role for another while, maybe a couple of weeks, though I do still have a long to-do list for the repo.

Best, Archit

worstseed commented 1 year ago

Hello @thchang,

I want to start off by apologizing for the delay in responding to your questions. We value your interest and feedback, and we're sorry for any inconvenience caused by the delay.

Your insights regarding the reference point for the hypervolume calculation in our benchmark are quite perceptive. Currently, we are using the minimum of all observed values, which is a common practice when there isn't an explicit worst-case scenario. However, your point about the potential for bias is well-taken. Ideally, we would indeed use worst-case scenarios from the benchmark as reference points. This approach would prevent methods that sample poor configurations from potentially inflating their hypervolume scores. We are grateful for your suggestion and will seriously consider revising our approach to provide a more accurate comparison of different solvers.

Regarding your second point about the objectives in the multi-objective benchmark, you are correct. The objectives are to maximize validation accuracy and minimize latency. We acknowledge that this could have been clearer in our evaluation protocol. Thank you for bringing this to our attention; we'll make sure to clarify this in our documentation.

We appreciate your valuable feedback and your interest in our work. Please do not hesitate to reach out if you have any further questions or suggestions.

Best regards, Maciej

automl / jahs_bench_201

Issue with evaluating multiobjective benchmarks #19