Open fitzgen opened 6 years ago
We've never really systematized this, unfortunately. It should be done. There are some questions about what kinds of test cases matter the most. I've usually operated on the assumption that it's the huge C++ inputs that can take a couple days to reduce that are the problem we should be attacking but I'm not even sure that's correct. Getting a collection of these difficult reductions would be nice.
Thanks for the reply! :)
I've been timing how long one particular bindgen
bug took to reduce, where I snapshotted the reductions along the way, so I don't have to run the whole days long reduction every time I want to test a change. Obviously, a partially reduced test case of size N isn't directly comparable to a fresh test case of size N, but I don't have days to wait for every little iteration ;)
Additionally, bindgen
's inputs are header files, and any function definitions we see are ignored, so I'm not sure that our test cases are more generally representative. It is, however, exactly what I personally care about ;)
All that said, I've got a little something up at https://github.com/fitzgen/reduction-benchmarks The benchmark is measuring time spent reducing as more parallel jobs are used. Its linux only at the moment because of dumb differences in time
on linux and macos. I haven't touched windows.
I've been exploring parallel reduction with https://github.com/fitzgen/preduce but it's still very much a WIP and I don't have any breakthroughs to report yet ;)
Cool!
Now that I've thought about this a bit more, we do have something: a collection of open source compiler bug triggers that I gathered a while ago for the "evaluation" part of a C-Reduce journal paper that has unfortunately gotten stalled out.
@eeide do you recall where those are? Can we make them public, if they aren't already?
Ah, @mpflanzer reminded me that the crash reductions are here: https://github.com/regehr/compiler-crashes There's a very wide variety in how fast they reduce. My idea was to optimize for total, not average, reduction time since speeding up short reductions is irrelevant.
Nevermind what I said about total vs average, bleh.
There is also the corpus that we used for the PLDI paper about C-Reduce, which I don't think that we have ever released as a packaged thing. I was using this corpus for the C-Reduce journal paper, too. We should release that corpus, too, but I won't have time to package it until at least August.
FWIW, another metric is the size of the final output.
IIRC, when I ran the PLDI corpus with current C-Reduce, some of the reduced outputs were larger than they were back in the day. They were all still quite small, as I recall, but some one or two might have ended up 100 chars larger (?). I would have to look this up.
Anyway, I at least wanted to mention "quality of reduction" as an important metric :-).
Well, one of the experiments I was going to run for the paper in question is (for some of the faster reductions) running them thousands of times with passes in randomized order in order to get a sense of how often the search space branches without coming back together.
My sense is that is some cases we'll end up more or less at the same final output regardless of path taken, and in others there'll be significant variation in what the program looks like when C-Reduce reaches a fixpoint.
How do you benchmark potential performance optimizations / tweaking priority heuristics / etc? Is there a suite of benchmarks somewhere? Time how long the tests take to run? Something else?
Thanks!