Wall clock time benchmark

MilesCranmer commented 2 years ago

As discussed in #62 with @lacava (and discussed a bit in #24 by others last year), I think a wall clock time benchmark would be a really nice complement to comparing over a fixed number of evaluations.

I think fixing the number of evaluations is only one way of enforcing a level playing ground. One could also fix:

The number of mutations
The number of subtree evaluations (i.e., count the # of total operators evaluated)
The number of FLOPS
The number of copies

or any other way of bottlenecking the internal processes of a search code. I think by measuring against only one of these, it artificially biases a comparison against algorithms which might use more of one of these, for whatever reason.

Some algorithms are sample intensive, while other algorithms do a lot of work between steps. I think that only by comparing algorithms based on # of evaluations, this artificially biases any comparison against sample intensive algorithms.

An algorithm and its implementation are not easily separable, so I would argue that you really need to measure by wall clock time to see the whole picture. Not only is this much more helpful from a user's point of view, but it enables algorithms which are intrinsically designed for parallelism to actually demonstrate their performance increase. The same can be said for other algorithmic sacrifices, like required for rigid data structures, data batching, etc.

Of course, there is no single best solution and every different type of benchmark will provide additional info. So I think this wall clock benchmark should be included with the normal fixed-evaluation benchmark, using a separate set of tuned parameters, which would give a more complete picture of performance.

Finally, I note I have a conflict of interest since PySR/SymbolicRegression.jl are designed for parallelism and fast evaluation, but hopefully my above points are not too influenced by this!

Eager to hear others' thoughts

MilesCranmer commented 2 years ago

One other point:

Number of evaluations becomes tricky to measure for deep learning models. How do I know a neural network has not learned how to do an internal evaluation of an equation? Perhaps some RL model is internally evaluating possible equations in the latent space before outputting an updated model.

Clock time is a universal way to bottleneck performance of deep networks or GAs in a way that corresponds with what a user would actually be after - time to get an equation.

lacava commented 2 years ago

I'm definitely open to running a benchmark where methods just get a fixed budget of wall-clock time and memory. It has its own problems (measurement variability, cluster heterogeneity, bias towards implementation), but agreed that it is closer to real world use case.

In fact, we're hosting a competition at GECCO 2022, based around srbench but with unseen datasets, and will discuss using limits by runtime there. (You should participate, there will be cash prizes!)

I have more thoughts, but here are just a couple:

Some algorithms are sample intensive, while other algorithms do a lot of work between steps. I think that only by comparing algorithms based on # of evaluations, this artificially biases any comparison against sample intensive algorithms.

Yes, this is true. Important to note that if the "work between steps" involves evaluating individuals on data (e.g. gradient descent or other local search), that is counted towards evaluations.

An algorithm and its implementation are not easily separable, so I would argue that you really need to measure by wall clock time to see the whole picture. Not only is this much more helpful from a user's point of view, but it enables algorithms which are intrinsically designed for parallelism to actually demonstrate their performance increase. The same can be said for other algorithmic sacrifices, like required for rigid data structures, data batching, etc.

It's important to note that we are currently benchmarking algorithms on single core per dataset and trial. Why? Because it is faster when you have 252 problems, 10 trials, and ~20 rounds of hyperparameter tuning (CV) on a limited set of resources (a cluster with ~1000 CPU cores in our case).

That's about 50k training instances per algorithm. So even if an alg runs 1000 times faster per instance on 1000 cores, we would get virtually no speedup when running the whole experiment (Speedup = 1/((1-p)+p/s) = 1.00 ; p=portion of work sped up (1/50k), s = speedup (1000)). Whereas we could get a speed up of ~1000 parallelizing problems, trials, and CV. (Amdahls Law)

Just another note: we've taken mini-batches into account in scaling # of evaluations. i.e. methods that only use 1/2 the data each round get 1M individual evaluations.

MilesCranmer commented 2 years ago

Thanks for the detailed explanations! The competition is a great idea.

It's important to note that we are currently benchmarking algorithms on single core per dataset and trial. Why? Because it is faster when you have 252 problems, 10 trials, and ~20 rounds of hyperparameter tuning (CV) on a limited set of resources (a cluster with ~1000 CPU cores in our case).

This makes sense for implementing the benchmark, but I think showing information about scalability is very important for real-world use cases. For me as a user I want an algorithm that can use all resources available to find the best equation in the shortest amount of time. This is not just an implementation detail, as some algorithms are intrinsically serial or parallel. And, for example, some algorithms might be able to exploit GPUs (e.g., a dictionary-based method like SINDy) or SIMD instructions, by making sacrifices in other areas - and this should definitely be shown.

How should one benchmark this? I think the benchmark should allow an algorithm to use up to some m number of cores (or a GPU, if an algorithm can use it). If it can't use all of the available cores and is stuck to one core, then those cores will remain idle and its recorded performance should take a hit as a result. Running the benchmark might result in idle cores, but I think it's definitely important to measure. Even 8 cores per algorithm would be good in my opinion - worse case is 8x the compute to run it, but I'm happy to contribute cluster hours if needed.

To explain why I think this is so important: the entire reason deep learning is considered successful is because it is able to exploit modern hardware so well, not because it can out-perform classic algorithms on a single core of CPU with a fixed # of FLOPs. Therefore I really think multicore performance (+ hardware accelerators) should be a core aspect of any benchmark. See this interesting paper on the "hardware lottery" for some discussion about this. Deep learning is not so much the best algorithm out there, as it is the winner of the hardware lottery, and its popularity is a result of this! So I think a benchmark should show who wins the hardware lottery, rather than trying to avoid it, since the hardware lottery has been (and will continue to be) very central to the history of ML.

marcovirgolin commented 2 years ago

I agree that including results w.r.t. a maximum time limit is interesting and can be very important in practice.

I think the best way forward is attempting to include wall-clock time-based experiments. I do not think it should replace what we have now, because evaluations have their own pros and cons just like time has.

The thing that might be a possible issue is the aforementioned need for serial parallelization inter-run instead of intra-run, given the size of SRBench. So, what if we reduce the number of problems considered for the time-based comparison, and focus only on "large" data sets and relatively "fast" algorithms only? I say this because efforts to make an algorithm faster are most meaningful when the problem is large, aren't they?

MilesCranmer commented 2 years ago

I do not think it should replace what we have now, because evaluations have their own pros and cons just like time has.

Definitely! There is no perfect metric, and I think wall clock time is just another datapoint, but it should not replace anything. I simply think that giving sample efficiency alone is not enough to see the full picture, so wall clock time should be given as another feature.

So, what if we reduce the number of problems considered for the time-based comparison, and focus only on "large" data sets and relatively "fast" algorithms only?

I think reducing the # of problems makes sense if there is a computational limit, definitely. Not sure about dataset size though - even small datasets should be enough of a constraint on a complex equation so long as there's no noise.

lacava commented 2 years ago

just thought i'd mention (in case anyone hadn't seen them) that there are additional comparison plots of runtime versus dataset size in https://github.com/cavalab/srbench/blob/master/postprocessing/blackbox_results.ipynb

e.g.:

I like the idea of doing a controlled comparison of a small number of datasets with different numbers of samples (and maybe different complexity generating function) with parallel hardware.

cavalab / srbench

Wall clock time benchmark #67