Implement Z3 perf regression tracking

nunoplopes commented 4 years ago

Z3 4.8.9 regressed perf compared to 4.8.8 in a few benchmarks (e.g., just see CI bots, both linux & Mac). It would be nice to have an automated way of diffing new timeouts in unit tests & LLVM test suite and then produce Z3 logs using Z3_open_log and friends. Similarly, we could have an automated way of running John's perl script to detect score degradations with new Z3 releases.

@NikolajBjorner says he'll fix any reported regression in 24 hours 😁

aqjune commented 4 years ago

Would it be great if I build a small webpage that tracks this?

nunoplopes commented 4 years ago

yes, I think that would be very useful, but it's not urgent 😀 It has a bit that is not obvious is that it needs to account for SMT random seeds. Z3 doesn't consider perf bugs if changing the seed makes it solve the problem. So we would need to sample a few seeds for each test and version.

NikolajBjorner commented 4 years ago

If the average run-time over several random seeds is higher, then it is still a perf regression.

aqjune commented 4 years ago

Okay, maybe I can pick a few random seeds (well, randomly) and show the individual results as well as the average result.

NikolajBjorner commented 4 years ago

As Nuno wrote use a few samples. The seeds don't need to be random, just numbers 1-10 will do the trick.

for i in range(10):
     os.system("z3.exe smt.random_seed=%d %s -st" % (i, file))

aqjune commented 4 years ago

I made this page: http://147.46.242.54:15243 It tracks the performance of Alive2 w.r.t Z3 when running LLVM unit tests. The detailed page shows in which tests timeout ratios are high (http://147.46.242.54:15243/show/8985651-af3789a-6cc52e0). Currently, it has only one slot filled but I'm going to run a batch of tests & fill the remaining slots. Also, there isn't diff two results page yet; I'll make the page, eh, maybe after the preparation of Q&A for my talk :)

aqjune commented 4 years ago

Okay, diff is now working! Dashboard is here. Log dump can be downloaded from the statistics page for the specific configuration (e.g., here). I'll run experiment with the fresh Z3 1~2 times per week.

regehr commented 4 years ago

this is very cool!

NikolajBjorner commented 4 years ago

Is there a way to (@nuno, I talking to you with the addition of Z3_open_log..)

extract self-contained repros for queries that are infeasible with all versions?
extract self-contained repros for queries that show large variations. There is a long tail of very easy queries and what appears to be a short head of difficult ones. So it seems the main areas to improve is to understand what makes the harder queries and this way flatten the curve.

nunoplopes commented 4 years ago

that's the plan. We've already added support for dumping Z3 logs. So it's just a matter of running Alive2 with the right arguments and then spam your email while you are on vacations 😁

@aqjune I really hope that code is PHP 🙄 Would be nice to upstream that.

aqjune commented 4 years ago

I can make the log dumps contain the generated VCs as well; the smt-verbose option, in addition to the z3 open log, will work. Or, to make the log dump more lightweight, I can add a page that lists the tests that has been timeout for a long period first and run on those only. It will make the log dump smaller (currently the log dump is already ~400 megabytes per configuration).

The webpage is written in Python using django. No special reason, just my PHP experience is almost nothing (to make a really simple homepage about >10yrs ago) :)

aqjune commented 4 years ago

Seems like Alive2 has a performance boost between https://github.com/Z3Prover/z3/commit/6cc52e04c3ea7e2534644a285d231bdaaafd8714 and https://github.com/Z3Prover/z3/commit/ae7d76767b614dd8f77a692e00d6d157df31e03d ! Is this an expected change?

nunoplopes commented 4 years ago

are you averaging over how many random seeds?

aqjune commented 4 years ago

I'm currently giving 5 seeds.

nunoplopes commented 4 years ago

I don't think there's any reason for a perf change judging from the Z3 commits msgs. One think to keep in mind is that Z3 is memory bound; it has a lot of cache misses. That means that if you have a lot of processes they start contending to access the memory and thus results won't be very reliable. You need to assess what's the max number of threads in your computer before perf starts to become unstable. No hyperthreading is a start, but you may need to go even lower depending on the memory BW of your CPU.

aqjune commented 4 years ago

Thank you for the advice! I was using hyperthreading enabled because using more cores gave additional speedup. For performance comparison, I'll disable it.

aqjune commented 4 years ago

Lowered the number of cores to 8 (from 24), and filling slots with the new configuration

AliveToolkit / alive2

Implement Z3 perf regression tracking #492