faster sharing-aware cost extraction

TrevorHansen commented 10 months ago

This: (a) speeds up the current greedy extraction, and (b) adds a faster extraction based on the code of @Bastacyclop & @mwillsey

In particular, changing the hashmap library to the rust standard collection reduces the runtime of the greedy extraction from 195 seconds to 47 seconds on my machine using all the test cases in main along with those in the open PRs: https://github.com/TrevorHansen/extraction-gym/tree/main/data

This PR also adds a new extractor that takes about 9 seconds. It's based on the greedy extractor in #8 combined with the simple extractor in main. This doesn't produce the same results though as the current greedy extractor:

Loaded 422 jsons.
extractors: ['faster-greedy-dag', 'greedy-dag']
data/tensat/nasneta_acyclic.json 16023614866670.953 16023517195541.84
data/tensat/nasrnn.json 1158.6070294405508 930.8220652824966
data/egg/math_simplify_add.json 3 7
Cummulative time for faster-greedy-dag: 8839ms
Cummulative time for greedy-dag: 47155ms
faster-greedy-dag / greedy-dag
geo mean
tree: 0.9970
dag: 1.0002
micros: 0.9750
quantiles
tree:   0.4286, 1.0000, 1.0000, 1.0000, 1.2447
dag:    0.9957, 1.0000, 1.0000, 1.0000, 1.0225
micros: 0.0910, 0.5888, 1.0630, 1.5546, 3.0305

I think you get different answers depending on the order in which you visit the nodes.

I'm happy to clean this up as required.

mwillsey commented 10 months ago

Oh interesting. I think this is implying that there are a couple heavy hitters in terms of time spent in extraction. The geometric mean reports that this code is about as fast, but in absolute time it's much faster. Any idea what's going on there?

TrevorHansen commented 10 months ago

Oh interesting. I think this is implying that there are a couple heavy hitters in terms of time spent in extraction. The geometric mean reports that this code is about as fast, but in absolute time it's much faster. Any idea what's going on there?

Yes, Max, you're right about that. With the 211 inputs, these 4 problems provide about 95% of the time improvement:

file	faster greedy dag extractor (ms)	old extractor with std hashmap (ms)
data/tensat/nasrnn_acyclic.json	312	2,560
data/tensat/nasneta.json	1,392	15,290
data/tensat/nasneta_acyclic.json	2,602	17,613
data/tensat/nasrnn.json	3,163	8,904
total	7,469	44,367

The other 207 problems take 1.3 seconds to be solved by the new extractor and 2.8 seconds for the old extractor. So there are a lot more easy problems than hard problems, which hides the improvement in the geometric average.

mwillsey commented 10 months ago

Awesome, thanks for the PR! Do you happen to have additional benchmarking data as well? As it is, it's quite hard to actually see the benefit of cost-sharing extraction, since few of the benchmarks seem to have it (or at least not that we have found).

TrevorHansen commented 10 months ago

Awesome, thanks for the PR! Do you happen to have additional benchmarking data as well? As it is, it's quite hard to actually see the benefit of cost-sharing extraction, since few of the benchmarks seem to have it (or at least not that we have found).

Yes, it makes sense that you'd like more benchmarks that show the benefit of sharing-aware extraction given that it's so expensive to undertake. I'll make some, but will be slow, probably 6 weeks before I can them ready.

egraphs-good / extraction-gym

faster sharing-aware cost extraction #10