Z3Prover / z3

The Z3 Theorem Prover
Other
10.38k stars 1.48k forks source link

Unexpected performance regression on a simple-looking query #4702

Open zvonimir opened 4 years ago

zvonimir commented 4 years ago

Z3 4.8.8 solves this query instantly (as well as all previous releases as far as I can remember), while the new Z3 4.8.9 times out: pero.zip

This is pretty unexpected since the query is small and (to the best of my knowledge) does not contain features where we would expect to see performance regressions when updating releases (non-linear arithmetic, quantifiers, etc.).

Any suggestions? What are we maybe missing?

Thanks!

zvonimir commented 4 years ago

Ok, this seems to be due to switching the default arithmetic solver to lra instead of simplex. When I revert to simplex it terminates fast. Is this expected or is this maybe a performance bug?

Could you maybe comment on lra vs simplex trade-offs a little bit, and also what your suggestion would be we should do?

I am sure you had good reason to switch to lra, but it seems that on our benchmarks, and even on simple one, simplex performs better. Thanks!

NikolajBjorner commented 4 years ago

Some notes for diagnosis:

levnach commented 4 years ago

@zvonimir , it seems fixed in the current release. Can you give it a shot?

NikolajBjorner commented 4 years ago

It appears to be a bit too early to test the change, but once we have figured this out it will be good with a more comprehensive validation, especially on small examples like this one that is more feasible to analyze.

zvonimir commented 4 years ago

I can test it on our regressions as soon as you think it is ready. Is it ready btw (sorry, the two messages above confused me)? Also, do you still have a nightly release (can't find it any more) or should I compile it from scratch on my own? Thanks!

NikolajBjorner commented 4 years ago

https://github.com/Z3Prover/z3/releases/tag/Nightly

NikolajBjorner commented 4 years ago

I tried the latest change. It still diverges depending on which random seed is used. So the change doesn't address the main cause and I posted a head's up that you will run into similar regressions. The change seems for the better (it avoids some heuristics when there are big-nums).

zvonimir commented 4 years ago

Got it. So we'll wait until you stabilize this a bit more. Thx!

levnach commented 4 years ago

Now it seems really fixed.

NikolajBjorner commented 4 years ago

Thanks @levnach! For @zvonimir: if you get a chance to try other regression tests with the default arithmetic solver setting it would be great to know about other places that are meaningful to your uses. We want to get the new arithmetic solver used in mainstream so weeding out issues that are relevant to users is a priority.

zvonimir commented 4 years ago

Thanks! I tried the new nightly build, and it is for sure working better in terms of performance/timeouts. However, there are still two benchmarks that are timing out, and they are terminating fine with the old arithmetic decision procedure. I am attaching the smaller one here: pero.txt Note that this one is a bit less straightforward in terms of what is causing this performance degradation since it contains quantifiers, etc. Please let me know what you think. Thanks!

zvonimir commented 4 years ago

Quick update. I run some more extensive experiments beyond just our small regressions. I am noticing quite a bit more timeouts with the new arithmetic solver on the same (or very similar) kinds of problems, meaning problems that generated queries similar to the one I posted just above.

NikolajBjorner commented 4 years ago

Ok, thanks. We will carve out some time to go through this. If you have a chance to assemble a few more samples it could help to prepare for this examination.

NikolajBjorner commented 4 years ago

I pushed a change that improves the picture somewhat on the attached example. It possibly doesn't address the root cause, but debugging the file revealed a way the solver could end up generating useless lemmas.

zvonimir commented 4 years ago

Your changes helped for our regressions. Since then, I run the nightly version on thousands of SVCOMP benchmarks, and we are seeing around 5% or so more timeouts now. I don't see a clear pattern among the newly introduced timeouts to be able to pick good representative queries. However, I am attaching here this one: pero.txt As is the case with the previous queries, the old Z3 solves it super fast, while the new on chokes on it for some reason. Maybe you can take a look at it and see if you can spot anything obvious.

Oh, another weird thing about this one is that setting smt.relevancy to 0 or not makes a difference, which I thought should not be the case since the query contains no quantifiers. Any hints as to why that is so?

Thanks!

zvonimir commented 3 years ago

So I finally run our large-scale experiments and did a comparison between old arithmetic solver and the new one in the nightly build. Right now the differences seem to be minimal. Here is an example: http://node0.smackbenchwebserver.smack.emulab.net/results.py?runset=runs%2Fexec_2020.12.19_13.19.02.447419_SoftwareSystems-DeviceDriversLinux64-ReachSafety&runset=runs%2Fexec_2020.12.15_10.44.41.142974_SoftwareSystems-DeviceDriversLinux64-ReachSafety#/

In terms of benchmarks that did not time out, the result is 1367 (new solver) vs 1372 (old solver). So the old solver is still a bit better, but the differences are pretty negligible it seems.

So we will make the switch I hope to the new solver as soon as you make your next release.

Thanks for your help!

NikolajBjorner commented 3 years ago

Nice benchmarking infrastructure! The scatter plot looks a little too uniform to be believable. So there is a chance that some parameters got ignored or overwritten or omitted. I typically look for samples where the differences are starkest and work from these. It is true I have yet to dig into your newer examples on this thread.

zvonimir commented 3 years ago

Oh, you are right. Let me double-check this. It indeed looks weird that they are so similar. I'll get back to you.

NikolajBjorner commented 3 years ago

If you have a way to retrieve SMT2 files it also helps. For the particular case of the last example, it matters a lot which random seed I use. It often suggests that there is something to improve with regards to fairness. Across a larger set of benchmarks, there will be enough samples where the fairness or similar issues poke out.

zvonimir commented 3 years ago

Is there debug info that Z3 prints out that tells me which arithmetic solver it is running?

NikolajBjorner commented 3 years ago

I use statistics and look for characteristics.

New solver (look for propagations-cheap)

(:added-eqs                       2632
 :arith-eq-adapter                163
 :arith-assume-eqs                1
 :arith-bound-propagations-cheap  1149
 :arith-bound-propagations-lp     995
 :arith-cheap-eqs                 11
 :arith-conflicts                 1980
 :arith-diseq                     238
 :arith-grobner-calls             9
 :arith-horner-calls              249
 :arith-horner-cross-nested-forms 243049
 :arith-lower                     2334
 :arith-make-feasible             3863
 :arith-max-columns               2372
 :arith-max-rows                  2006
 :arith-nla-explanations          1975
 :arith-nla-lemmas                1976
 :arith-propagations              1149
 :arith-upper                     2262

old solver (look for arith-tableau-max-columns and similar):

(:added-eqs                 113457
 :arith-assert-diseq        7122
 :arith-assert-lower        12130
 :arith-assert-upper        14889
 :arith-assume-eqs          3
 :arith-bound-prop          32
 :arith-branch-var          1
 :arith-conflicts           413
 :arith-eq-adapter          224
 :arith-fixed-eqs           1924
 :arith-gcd-tests           2
 :arith-grobner             416
 :arith-ineq-splits         1
 :arith-max-min             8238
 :arith-nonlinear-bounds    428
 :arith-nonlinear-horner    8359
 :arith-num-rows            45
 :arith-offset-eqs          366
 :arith-patches             1
 :arith-pivots              3707
 :arith-pseudo-nonlinear    292
 :arith-row-summations      56690
 :arith-tableau-max-columns 411
 :arith-tableau-max-rows    45
zvonimir commented 3 years ago

Ok, I got it. So scatter plot generation is buggy for some reason. We haven't been using it that much recently, and I guess some of our recent updates broke it. Not sure. But you should totally disregard the scatter plots.

If you look at raw results in the table for this set: http://node0.smackbenchwebserver.smack.emulab.net/results.py?runset=runs%2Fexec_2020.12.19_13.15.50.232242_SoftwareSystems-AWS-C-Common-ReachSafety&runset=runs%2Fexec_2020.12.15_10.17.38.505765_SoftwareSystems-AWS-C-Common-ReachSafety#/table

you can notice that there are noticeable differences in runtimes, especially for the longer running queries.

I also did a sanity check based on your suggestion above, and we are indeed running the new arithmetic solver.

NikolajBjorner commented 3 years ago

Sometimes taking shorter running examples that show average regressions is easier. For example these ones:

aws_byte_buf_append_dynamic_harness.ymltrue true 18.0  19.1  99.8 true 11.0  12.7  95.2 aws_priority_queue_remove_harness.ymltrue true 26.9  27.9  178   true 10.6  12.2  140   aws_priority_queue_push_harness.ymltrue true 24.4  25.3  166   true 10.5  12.0  145   aws_byte_buf_reset_harness.ymltrue true 18.5  19.6  92.1 true 10.5  12.4  92.1 aws_priority_queue_push_ref_harness.ymltrue true 20.8  21.9  160  

zvonimir commented 3 years ago

Got it. If there is a particular one you are maybe interested in (or a few of them), I can generate the query for you.

NikolajBjorner commented 3 years ago

Yes please, exactly these ones or 3-6 of them, such as:

aws_priority_queue_push_ref_harness.yml aws_byte_buf_reset_harness.yml aws_priority_queue_push_harness.yml

as SMTLIB2 commands, ....

NikolajBjorner commented 3 years ago

For your piro.txt case: it is indeed fairness. Generally the new solver is much faster, but then gets trapped in some cases during non-linear lemma generation. It seems to indicate that we are not using inference rules with sufficient fairness. The example goes through if I remove the particular case for lemma generation manually in the code. (@levnach it is hitting the order_lemma_on_binomial_sign case, which uses the values of variables in the new lemmas. Since the values change between calls it may end up generating a large number of lemmas without yielding the floor to other procedures, such as tangent and nlsat).

NikolajBjorner commented 3 years ago

@zvonimir. A different report exposed places where the new solver was missing propagation. Fixes to this helped. If you can provide perf regression repros then great, otherwise I would like to close this issue.

zvonimir commented 3 years ago

Sorry for the delay. Let me run our regressions on the latest nightly release.

zvonimir commented 3 years ago

@NikolajBjorner Quick question. I don't see a nightly release for Ubuntu. I am looking here: https://github.com/Z3Prover/z3/releases/tag/Nightly Am I missing something? Should I build it from source on Ubuntu!? Thanks!

NikolajBjorner commented 3 years ago

it gets named glibc these days.

zvonimir commented 3 years ago

Ok, so I tried the nightly version. At this point I see very few difference between the two solvers. However, there are still some left. I attached one such query (tmp.zip), which gets solved using the old solver (meaning, when (set-option :smt.arith.solver 2) is set), while Z3 doesn't come back when the new solver is used.

Maybe it is a relatively easy fix. Thanks!!!

zvonimir commented 3 years ago

Here is another one with the same issue: jain.zip It is smaller and so it might be easier to debug.

NikolajBjorner commented 3 years ago

Thanks, this is definitely much more approachable. It is still not too easy to diagnose these so it is a longer process. But given that you are not blocked (you can still use the old solver) I am taking a long-term perspective on this.

levnach commented 3 years ago

Jain.smt2 was solved fast for me in the new solver last version f4127bd6f3bd19079143ab73c4ab3779038a0cea. From the other side, tmp.smt2 was not.

NikolajBjorner commented 3 years ago

It depends on smt.random_seed value. Some seeds are quick, others slow for (set-option :smt.arith.solver 6). Note also that the option is set inside the file. At least it is a much simpler test case than tmp.smt2.

zvonimir commented 3 years ago

Not sure if this is helpful or not, but this is the C code of the jain benchmark: https://github.com/smackers/smack/blob/master/test/c/basic/jain_2_true.c Maybe you can get a sense for what kind of operations there are in the benchmark. There are 3 other jain benchmarks (see https://github.com/smackers/smack/tree/master/test/c/basic), which are all very similar, and yet those are not timing out. Thanks!

NikolajBjorner commented 3 years ago

@levnach For Jain.smt2, the new solver ends up with an infeasible tableau

x + y + 1 + 1/2 z = 0
u + v + 1/2 + 1/2z = 0

where x, y, z, u, v are integers. It performs cuts on z and branch on x, y, u, v.

The tableau is infeasible since the first row implies that z has even parity, the second that z has odd parity.

NikolajBjorner commented 3 years ago

So far it looks like the problem becomes solvable if the lp solver pivots into z. This happens on the successful run I see, but not for the unsuccessful random seeds. Determining whether adding fairness to the LP solver helps could be one avenue. A different is to revisit the notes https://github.com/Z3Prover/z3/blob/ba56bfa65640ff98133e1a04edf647c8b69bea71/src/math/lp/int_gcd_test.cpp#L32 on the accumulative gcd test. This would take care of scenarios like these.