egraphs-good / extraction-gym

benchmarking e-graph extraction
MIT License
24 stars 15 forks source link

Line graph of benchmark results #33

Open TrevorHansen opened 6 months ago

TrevorHansen commented 6 months ago

This produces a graph of the benchmark results, e.g:

dag_cost_improvement

For each extractor it shows the improvement v.s. time to a reference extractor. In this I ran each extractor with a 2000 second timeout, and after that returned the result from faster-greedy-day. Given faster-greedy-dag is also the reference, the timeouts just show time being used, and no improvement.

I've got some ideas about how to make it more informative e.g:

I'm keen to hear what others would find helpful?

mwillsey commented 6 months ago

I love the idea of better graphics, but I'm afraid I don't really understand what's going on with this plot atm.

Bastacyclop commented 6 months ago

I can image plotting, for every dataset and for their cumulation, a figure with the extractor on the x-axis, and time or cost quantiles on the y-axis.

Bastacyclop commented 6 months ago

I also like the general idea of Trevor's graph for incremental extractors: because you can decide how much time to spend on extraction, there is trade-off between time allowed and the actual cost reduction achieved. Maybe incremental extractors could have a different "tag" or trait to produce these additional plots, where cost can be reported over time.

TrevorHansen commented 6 months ago

Thanks @Bastacyclop & @mwillsey for the input.

I've not compared optimisation solvers before, so I don't know how it's commonly done. From an updated comment in the plot.py file, what this graphs shows is:

This assumes an extractor is run on the all the egraph benchmarks at the same time. So given 500 egraphs, each will receive 1/500th of a second of that first second's CPU time. Say 10 egraphs finish processing with their extractor with less than 1/500th of a second's CPU time, i.e. they have a runtime of less than 2ms. Then for the 2nd second of CPU time, each egraph will get 1/490th of a second of CPU time.

Continuing the example, if those 10 egraphs which were processed in the first 1/500th of a second, improved on the cost versus the reference implementation by 20, then the graph will plot an improvemement of 20 at 1 second.

At 2 seconds, the improvement will be the sum of the improvements of all the extractors which finished in less than 1/500th + 1/490th of a second, that is that finished with a total runtime of less than 4.04ms.

This will continue until the timeout on the extractor is reached.

I don't have any claim that this is the best way to compare extractors graphically. Just some way that makes sense to me.

mwillsey commented 6 months ago

I think that might be a little too complicated. Perhaps another idea is to just scatter plot time vs cost. Each point is a single extraction of a single egraph, x = time, y = cost, color/shape = extractor. I think your idea of normalizing the cost is a good one and might be needed in this scheme as well. But yes, this seems like a tricky visualization problem.

TrevorHansen commented 5 months ago

I've made this a bit more confusing, but more useful. It now plots cumulative percentage improvement. So 1% means that the arithmetic-average improvement in the DAG-cost across all the benchmarks is 1%.

I tried log-scale for time, but didn't find it as helpful (see below).

Next, I'll do a separate plot which shows the number of timeouts. I think we need a dashboard with a few graphs to show what's happening.

I tried the scatter plot, but found it too crowded to be helpful.

dag_cost_improvement

dag_cost_improvement